Abstract
Non-small cell lung cancer (NSCLC) is histologically classified into
lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LSCC).
However, some tumors are histologically ambiguous and other
pathophysiological features or microenvironmental factors may be more
prominent. Here we report integrative multiomics analyses using data
for 229 patients from a Korean NSCLC cohort and 462 patients from
previous multiomics studies. Histological examination reveals five
molecular subtypes, one of which is a NSCLC subtype with PI3K-Akt
pathway upregulation, showing a high proportion of metastasis and poor
survival outcomes regardless of any specific NSCLC histology.
Proliferative subtypes are present in LUAD and LSCC, which show strong
associations with whole genome doubling (WGD) events. Comprehensive
characterization of the immune microenvironment reveals various immune
cell compositions and neoantigen loads across molecular subtypes, which
predicting different prognoses. Immunological subtypes exhibit a hot
tumor-enriched state and a higher efficacy of adjuvant therapy.
Subject terms: Non-small-cell lung cancer, Cancer genomics, Proteomics,
Immunogenetics
__________________________________________________________________
Subtyping of non-small cell lung cancer can be challenging based on
pathology. Here, the authors utilise multi-omics analysis of 229
patients to identify further subtypes, and altered immune composition
between subtypes.
Introduction
Lung cancer is a major health concern worldwide, accounting for 18% of
global cancer-related deaths^[78]1. Surgical resection, with or without
postoperative adjuvant therapy, is currently the first-line treatment
for locally advanced or early-stage lung cancers^[79]2. In advanced
cases, genomics-based targeted therapies and immunotherapies using
immune checkpoint inhibitors provide promising treatment options^[80]3,
often in addition to chemotherapy. Although the mortality rates
associated with lung cancer are continuously decreasing with
appropriate surgical and medical treatment^[81]4, the 5-year survival
rates for localized, regional, and metastatic non-small cell lung
cancer (NSCLC) remain unsatisfactory (64%, 37%, and 8%, respectively),
mainly due to recurrence after treatment^[82]5. Therefore, tailoring
treatments based on patient stratification according to molecular
characteristics is of increasing interest to improve patient survival.
High-throughput omics approaches have facilitated the classification of
NSCLC molecular subtypes and candidate molecular targets as well as
prognostic or predictive biomarkers for lung cancer treatment^[83]6.
Initially, The Cancer Genome Atlas Research Network identified the
molecular pathways of lung cancer and classified lung squamous cell
carcinoma (LSCC) into classical, basal, secretory, and primitive
subtypes^[84]7, and lung adenocarcinoma (LUAD) into terminal
respiratory unit, proximal inflammatory, and proximal proliferative
subtypes^[85]8. Recently, proteomic and proteogenomic
approaches^[86]9–[87]14 have been introduced to identify novel subtypes
and their druggable targets.
Although these efforts have advanced our understanding of cancer
biology, the clinical implementation of these findings in
high-throughput approaches remains challenging^[88]15,[89]16. First,
many proteogenomic studies have focused on a single subtype of NSCLC
(such as LUAD or LSCC), but the approach excludes cases in which the
pathological classification is ambiguous and/or
discrepant^[90]17,[91]18, which eventually leads to the formation of
study cohort that may not be representative of real NSCLC population.
Second, the sensitive prediction of postoperative survival beyond those
achieved using conventional prognostic factors is a prerequisite for
personalized surgical oncology treatment. Further characterization of
causative biomarkers associated with prognosis may identify crucial
treatment targets^[92]19, and requires a multiomics-based analysis of
well-annotated clinical data. Finally, most multiomics data have been
generated from resected lung cancer tissues, and adjuvant chemotherapy
after surgery is the primary treatment for locally advanced lung
cancer. Adjuvant chemotherapy largely relies on platinum doublet and
platinum-based chemotherapeutic regimens^[93]20 in addition to a
cytotoxic agent (e.g., pemetrexed, gemcitabine, vinorelbine, or
paclitaxel). Although this approach has improved patient
survival^[94]21, few studies^[95]22 have investigated the molecular
basis of the response to postoperative adjuvant treatment, and
consensus biomarkers for predicting treatment efficacy have not yet
been identified.
In this study, we conducted a comprehensive multiomics analysis to
define the molecular subtypes of NSCLC using a Korean NSCLC discovery
cohort of 229 patients and a replication cohort of 462 NSCLC patients
from previous multiomics studies. Our aim was to expand the scope of
histological subtyping and identify molecular subtypes of NSCLC with
potential prognostic and therapeutic implications. We further
characterized the subtypes by integrating a large-scale single-cell RNA
sequencing dataset^[96]23 of NSCLC and evaluated their cellular
specificity and histopathological relevance. In addition, our study
included an extensive histological review of patient data for
tumor-infiltrating lymphocytes (TILs), identified potential neoantigens
and cryptic peptides characterizing the immune microenvironment, and
noted the varying efficacy of adjuvant therapies between subtypes.
Overall, our study represents a considerable advance in the field of
NSCLC research and has important implications for precision medicine
and personalized therapies.
Results
Identification of subtypes in NSCLC patients by multiomics
The multiomics analysis utilized a retrospective cohort of 229 Korean
patients diagnosed with NSCLC at Asan Medical Center in Seoul, Korea
(Supplementary Data [97]1a). Tumor samples with matched normal tissues
or blood samples were collected via surgery between 2010 and 2019.
Patient demographics, disease parameters (including histology and
tumor-node-metastasis [TNM] staging), survival, and treatment response
indicated good coverage of disease severity and patient population.
Histologically, the tumor samples included 139 adenocarcinomas (LUADs,
61%), 63 squamous cell lung carcinomas (LSCCs, 27%), and 27 tumors of
other types (12%) (Fig. [98]1a, Supplementary Data [99]1a). LUAD cases
were almost equally distributed between the sexes (66 males and 74
females), whereas LSCC cases were found mostly among males (97%,
n = 61), which was previously shown in other Korean LSCC
cohort^[100]24. Self-reported smoking status indicated that 61% of the
patients had a smoking history, with a higher prevalence of smoking in
males and patients with LSCC. The TNM-based stages ranged from IA1 to
IVA, with approximately 40% of the patients having late-stage disease
(IIIA, n = 75; IIIB, n = 9; IVA, n = 11; Fig. [101]1a). Approximately
half of the patients had lymph node metastases (51%, n = 116) at the
time of pathological diagnosis after surgery. Adjuvant therapy,
including chemotherapy (CTx) or radiation therapy (RTx), was
administered to 48% of the patients (110/229 patients; 59 CTx, 16 RTx,
and 35 CTx and RTx) according to the NCCN guidelines. Tumor recurrence
was observed in 54% of the patients who received adjuvant therapy, and
the recurrence rate was similar across treatments (CTx, 56%, 33/59;
RTx, 50%, 8/16; CTx and RTx, 54%, 19/35) (Fig. [102]1a) and
histological diagnoses (53%, AD, 72/137; 44%, SC, 27/62) (Supplementary
Fig. [103]1a).
Fig. 1. Identification of multiomics subtypes in Korean NSCLC patients.
[104]Fig. 1
[105]Open in a new tab
a Summary of clinical information for the Korean NSCLC patient cohort.
Bar plots show NSCLC histology, tumor stages, pathologic N status, and
tumor recurrence status with adjuvant treatment. b Numbers of
quantified proteins and PTM sites identified in global proteomic,
phosphoproteomic, and acetylproteomic analyses. The number of features
quantified at <30% missing values across 229 samples is represented by
the yellow line. c Overview of NMF clustering. Other tumor suppressor
genes consisted of CDKN2A, STK11, KEAP1, RB1, PPP2R1A, and SMARCA4.
Other oncogene alterations consisted of frameshift deletions; in-frame
deletion/insertion and missense mutations in KRAS, ERBB2, and PIK3CA;
exon skipping in MET; and gene fusion in ALK, ROS1, and RET. Copy
number loss was defined as homozygous deletion (absolute copy number
<0.5). d Enrichment of the five identified NMF clusters for clinical
variables. Tests indicating statistical significance (P < 0.05,
two-sided Fisher’s exact test) are colored according to the odds ratio
(OR). Box size indicates the proportion of the cohort characterized by
a given clinical variable. e Pathway enrichment analyses of the five
subtypes using GSVA and PTM-SEA. Pathways with statistical significance
(FDR < 0.05, permutation) and positive enrichment scores (z-score) are
represented by dots.
We generated genomic, transcriptomic, proteomic, phosphoproteomic, and
acetylproteomic datasets from the samples. A genomic dataset was
generated by whole-exome sequencing (WES) of 228 normal adjacent tissue
(NAT)-matched tumors and one tumor-only sample with a read depth
sufficient for variant discovery (tumor: ~300X, NAT: ~100X). We
observed 33,301 somatic small mutations that contained single
nucleotide variants (SNVs) and indels (on average, 145 per sample),
470,836 copy-number alterations encompassing amplification, gain,
heterozygous deletion, and homozygous deletion (on average, 2056 per
sample) and a 2.7 tumor mutation burden (TMB) score in 229 tumors. For
transcriptomic analysis, we performed bulk RNA-seq for 205 tumors and
85 matched NATs for deep coverage (approximately 120 M reads per
sample), enabling gene expression quantification and alternative
splicing isoform discovery. We acquired 60,688 transcripts and selected
20,088 transcripts, based on low-count genes across samples, for
subsequent analysis. For proteomic analysis using tandem mass tag
(TMT)-based isobaric labeling, proteomic data were collected from 229
tumor samples and 26 matched NATs. A total of 10,788 proteins, 40,738
phosphosites, and 5975 acetylation sites were observed in at least 30%
of the samples and quantified as a log[2] ratio to the common reference
(CR) sample (Fig. [106]1b).
For multiomics analysis, we integrated proteomic, phosphoproteomic, and
acetylproteomic data and conducted non-negative matrix factorization
(NMF) clustering to identify multiomics subtypes in the 229 NSCLC
samples. We identified five multiomics subtypes: metabolic (Subtype 1),
alveolar-like (Subtype 2), proliferative (Subtype 3), hypoxic (Subtype
4), and immunogenic (Subtype 5) (Fig. [107]1c), characterized based on
genetic mutations, clinical phenotypes, and molecular pathways
(Fig. [108]1d, e, Supplementary Data 1b–d).
Subtype 1 was composed mainly of LUAD females (64%, 35/55) with EGFR
and TP53 mutations, as well as a high frequency of whole genome
doubling (WGD) events (i.e., phenomena in which more than half of the
chromosomes are gained; Fig. [109]1c), suggesting a chromosomally
unstable co-driven subtype. Significant enrichment of CDKN2A copy
number loss in Subtype 1 (OR: 3.63, P = 2.35 × 10−^2, Fisher’s exact
test) also supports the observation (Fig. [110]1d). Subtype 2 mainly
comprised patients with LUAD (71%, 32/45) with EGFR mutations (49%,
22/45) and without WGD events (Fig. [111]1c), suggesting a
chromosomally stable oncogene-driven subtype. These samples showed a
significantly lower frequency of TP53 mutations (6%; odds ratio [OR]:
0.04, P = 3.1 × 10^−9, Fisher’s exact test; Fig. [112]1d) and much
lower tumor mutational burden (TMB) than samples representing other
subtypes (Subtype 2 = 0.9 variants per Mb, others = 3.2 variants per
Mb; P = 3.6 × 10^−10, Wilcoxon ranked sum test) (Supplementary
Fig. [113]1b, c). Despite the upregulation of the EGFR pathway in both
subtypes, molecular pathways were specifically enriched for each
cluster. Subtype 1 exhibited significant upregulation of proteins
involved in oxidative phosphorylation (adjusted P = 1.7 × 10^−6,
Wilcoxon rank-sum test), mitochondrial matrix (adjusted P = 8.1 ×
10^−9, Wilcoxon rank-sum test), and cellular respiration (adjusted
P = 4.2 × 10^−5, Wilcoxon rank-sum test), indicating an association
with metabolic pathways. Conversely, Subtype 2 was characterized by a
significant upregulation of phosphorylation in the IL-33 (adjusted
P = 1.8 × 10^−15, Wilcoxon rank-sum test) and Notch pathways (adjusted
P = 2.4 × 10^−16, Wilcoxon rank-sum test) (Fig. [114]1e), implicating
these molecular features in early tumorigenesis and early-stage
disease.
Subtype 3 was significantly associated with WGD events (OR: 13.5,
P = 4.6 × 10^−10, Fisher’s exact test), TP53 (OR: 13.5, P = 4.6 ×
10^−10, Fisher’s exact test) and PIK3CA mutations (OR: 3.89, P = 3.16 ×
10−^2, Fisher’s exact test), and was more prevalent in patients with
LSCC (65%, 34/52 patients), in males (83%, 43/52 patients), and smokers
(85%, 44/52 patients) (Fig. [115]1d). Subtype 3 exhibited a highly
proliferative phenotype, as evidenced by significant enrichment in cell
cycle-related pathways, including E2F/MYC target (adjusted P = 4.7 ×
10^−22, 1.3 × 10^−20, Wilcoxon rank-sum test), G2M checkpoint (adjusted
P = 2.4 × 10^−23, Wilcoxon rank-sum test), and cyclin-dependent kinase
(CDK) target pathways (adjusted P = 5.3 × 10^−24, Wilcoxon rank-sum
test) (Fig. [116]1e). Thus, this subtype can be defined as a
chromosomally unstable tumor suppressor-deficient proliferative
subtype.
Subtype 4 was not associated with any specific histological type of
NSCLC but was significantly enriched for metastasis (OR: 3.0, P = 5.6 ×
10^−3, Fisher’s exact test). We found that phosphorylated sites in this
subtype were upregulated in hypoxia (adjusted P = 2.3 × 10^−4, Wilcoxon
rank-sum test), PI3K-Akt (adjusted P = 4.7 × 10^−3, Wilcoxon rank-sum
test), and neutrophil degranulation (adjusted P = 1.7 × 10^−4, Wilcoxon
rank-sum test) pathways, and proteins were also enriched for neutrophil
degranulation (adjusted P = 4.6 × 10^−6, Wilcoxon rank-sum test),
suggesting a potential role in the promotion of tumor migration,
invasion, and metabolism in tumor metastasis. Therefore, Subtype 4 can
be considered as a chromosomally stable mesenchymal subtype.
Subtype 5 showed a significantly elevated proportion of
tumor-infiltrating lymphocyte (TIL)-associated patterns (OR: 3.6,
P = 1.2 × 10^−2, Fisher’s exact test; Fig. [117]1d) and enrichment of
immune-related pathways, such as TNFα signaling via NF-κB (adjusted
P = 1.1 × 10^−6, Wilcoxon rank-sum test, Fig. [118]1e), suggesting that
this subtype was a high-immune and chromosomally stable
tumor-suppressor-driven inflammatory subtype. KRAS mutation was
significantly enriched in Subtype 5 (OR: 4.51, P = 3.53 × 10^−2,
Fisher’s exact test) (Supplementary Fig. [119]1d), with four cases
presenting concurrent TP53 mutation. STK11 and KEAP1 mutations were
each identified only in 1 case, although the trends were not
statistically significant.
A NSCLC subtype associated with poor prognosis and frequent metastasis
To replicate our subtype classification, we utilized multiomics or
proteomics data from 462 patients with NSCLC obtained from previous
studies, including two LUAD studies by Gillette et al.^[120]10
(n = 110) and Xu et al.^[121]11 (n = 103), an LSCC study by Satpathy et
al.^[122]13 (n = 108), and an NSCLC study by Lehtio et al.^[123]14
(n = 141). We compared the top features of the NMF subtypes between our
cohort and those of other studies and found significant overlaps
(FDR < 0.01, Fisher’s exact test): the terminal respiratory unit
subtype of adenocarcinoma (Subtype 1), inflammatory subtypes (Subtypes
2 and 5), and proliferative subtypes of LUAD and LSCC (Subtype 3)
(Fig. [124]2a and Supplementary Data [125]2a). In contrast, Subtype 4
was distinct, showing enrichment for phosphorylation features
associated with the EMT-enriched (LSCC)^[126]13 subtype and acetylation
features from the inflammatory subtypes of LUAD and LSCC (Supplementary
Fig. [127]2a, b). We also performed a combined NMF analysis for a total
of 447 patients with NSCLC by integrating proteome, phosphoproteome,
and acetylome datasets from our study with those from previous
studies^[128]10,[129]13 (hereafter called the “combined CPTAC
dataset”). We identified five subtypes, referred to as “Combined NMF”
(Fig. [130]2b). The combined NMF showed a highly consistent pattern
with the feature overlap analysis (Fig. [131]2a and Supplementary
Data [132]2a) and confirmed that the four subtypes (Subtype 1, 2, 3, 5)
showed a consistency with the previously identified NMF subtypes
(Fig. [133]2b, Supplementary Fig. [134]2c and Supplementary
Data [135]2b). Subtype 4 did not cluster with any single subtype of
LUAD or LSCC in the combined NMF analysis (Fig. [136]2b). A comparison
without the Korean cohort (“CPTAC NSCLC”), constituting 51% (229/447)
of the Combined NMF, showed 77% of patients maintained their subtype
classification when including the Korean cohort (Fig. [137]2c).
Additionally, Subtype 4 consistently comprised 18.8% (43/229) of the
Korean NSCLC cohort and 18.3% (40/218) of the combined CPTAC dataset
without histological types. Collectively, our findings suggest that
Subtype 4 represents a NSCLC subtype that requires further clinical and
molecular characterization.
Fig. 2. Novel NSCLC subtype associated with poor survival.
[138]Fig. 2
[139]Open in a new tab
a Overlap of subtype features between the five NMF subtypes in this
study and subtypes identified in previous NSCLC multiomics studies.
Protein enrichment is in the heatmap. Full rectangle and asterisk
indicate significant overlaps (Two-sided fisher’s exact test adjusted P
≤ 0.05, Benjamini-Hochberg adjustment), faint rectangle indicates
overlaps which pass only nominal P value (Two-sided fisher’s exact test
P ≤ 0.05, two-sided fisher’s exact test adjusted P > 0.05), and blank
indicates overlaps which is not significant (two-sided fisher’s exact
test P > 0.05). b, c Reclassification of samples from previously
defined multiomics subtypes^[140]10,[141]13 according to our combined
NMF subtypes. The statistical significance of the relationship is
visually represented by the clarity and transparency of the lines
(Supplementary Fig. [142]2c). d Subtype 4-specific kinase activity
scores estimated from phosphoproteomic data and the kinase-substrate
network database (PHONEMeS). The colors of the points represent the
estimated kinase activity scores. The sizes of the points represent the
-log[10](FDR) of the kinase activity estimates. There were two
significantly upregulated kinases: CSNK2A1 and GSK3B (FDR < 0.05). e
Expression of poor prognosis markers containing phosphorylated sites on
SLK (S347) and LRRFIP1 (S581) is shown for our study (Subtype 4,
n = 43; others, n = 186) and CPTAC (LUAD (Subtype 4, n = 26; others,
n = 84), LSCC (Subtype 4, n = 15; others, n = 93)). Wilcoxon rank-sum
test was performed to test the differences in expression. The color of
the dots in the right panel represents the study type in CPTAC. For
box-plots, middle line, median: box edges, 25th and 75th percentiles;
whiskers, most extreme points that do not exceed ±1.5 × IQR. f
Cancer-specific overall survival length according to the expression of
poor prognosis markers in our study and the CPTAC dataset (integrated
with LUAD and LSCC). The p-value was calculated with the log-rank test.
g Intracellular signaling pathways underlying poor prognosis in Subtype
4. The blue box represents the main signaling pathways, including the
HIF-1, VEGF, PI3K-AKT, and NF-κB signaling pathways. The red triangular
nodes are kinases identified as significantly upregulated in Subtype 4.
The colors of the points represent the log[2]FC values obtained through
differential expression (DE) analysis of Subtype 4 and the other
subtypes. The border style of the point indicates the prognostic
direction of the feature. h Cancer-specific overall survival length
between our subtypes indicating significant changes in survival
probability (y-axis) over time (x-axis). i Survival curves for patients
with (n = 35) and without metastasis (n = 8) in Subtype 4 (n = 43) and
( j) patients without metastasis in each subtype (n = 91). The p-value
was calculated with the log-rank test (h–j).
To examine the molecular characteristics of Subtype 4, we extracted
distinct features (proteins, phosphorylation, or acetylation) of the
subtypes in our NMF clustering analysis (Supplementary Data [143]1e).
We found that the majority of NMF features present in Subtype 4 were
phosphorylated sites (96%, 178/186, Supplementary Data [144]1e),
indicating that phospho-kinase interactions are a major signature.
Therefore, we investigated the kinase activity of this subtype using
phosphoproteomic data. We found significant enrichment of two kinases,
CSNK2A1 (FDR = 2.3 × 10^−7) and GSK3B (FDR = 1.9 × 10^−2), which are
known to phosphorylate various proteins in the PI3K-AKT signaling
pathway, in Subtype 4 compared to other subtypes (Fig. [145]2d,
Supplementary Fig. [146]2d and Supplementary Data [147]2c). Upon
evaluating the relationship between the activity and expression levels
of significant kinases (P < 0.05), we observed a moderate correlation
(Supplementary Fig. [148]2e). Survival analysis based on feature
expression (Supplementary Data [149]2d) showed that most of the
unfavorable prognostic factors were phosphorylated sites differentially
expressed in Subtype 4 (91%, 104/114) (Supplementary Fig. [150]2f).
Notably, STE20-like serine/threonine-protein kinase at serine 347 (SLK
(S347)), a protein phosphorylated by CSNK2A1, was significantly
upregulated in Subtype 4 (adjusted P = 8.0 × 10^−3, Benjamini–Hochberg
adjustment, Supplementary Data [151]3a) and associated with unfavorable
prognostic features (P = 3.0 × 10^−6, log-rank test) (Fig. [152]2e, f).
In the combined CPTAC dataset, we also found increased phosphorylation
of SLK (S347) in Subtype 4 and poor survival outcomes (Fig. [153]2e, f,
Supplementary Fig. [154]2g and Supplementary Data [155]2e). SLK
mediates apoptosis downstream of the ErbB2 and PI3K pathways^[156]25
and is activated in a CSNK2A1-dependent manner^[157]26. Recent studies
have reported that high SLK expression is associated with reduced
overall survival in HER2-positive patients^[158]27 and in
glioma^[159]28. Using both our cohort and the CPTAC cohort to check the
ROC curve, we found that SLK (S347) is an effective marker for
distinguishing Subtype 4 not only in our cohort but also more
effectively in the CPTAC cohort (Supplementary Fig. [160]2h).
Collectively, these results suggest that SLK participates in tumor
progression and could be a key marker specific to Subtype 4.
Furthermore, Subtype 4 showed a significant upregulation of
leucine-rich repeat flightless-interacting protein 1 at serine 581
(LRRFIP1 (S581)) phosphorylation (adjusted P = 1.3 × 10^−2,
Benjamini–Hochberg adjustment, Supplementary Data [161]3a), which was
correlated with unfavorable prognostic features (P = 1.2 × 10^−3,
log-rank test) (Fig. [162]2e, f). Similarly, in the combined CPTAC
dataset, elevated phosphorylation of LRRFIP1 (S581) was observed in
Subtype 4, which coincided with poor survival outcomes (Fig. [163]2e,
f, Supplementary Fig. [164]2g and Supplementary Data [165]2e). Notably,
LRRFIP1 stimulates the epithelial–mesenchymal transition (EMT) pathway
by modulating the Wnt/β-catenin signaling pathway^[166]29. A recent
study^[167]30 demonstrated that high LRRFIP1 expression was associated
with reduced overall survival in glioma. These findings collectively
suggest that LRRFIP1 may also contribute to cellular invasion and
metastasis and could serve as a key marker specific to Subtype 4.
Subtype 4 included numerous prognostic features in the HIF-1 and
PI3K-AKT signaling pathways (Fig. [168]2g and Supplementary
Fig. [169]2i). Among these unfavorable features, the significantly
upregulated ANGPT2 protein is known to increase in a hypoxic
environment, which can promote the release of pro-angiogenic cytokines,
such as VEGF, through HIF-1 accumulation. We also observed increased
acetylation of TIMP3 at K65, which is presumably involved in the
ANGPT2-induced hypertensive response. In contrast, CDKN1B, which was
significantly downregulated (adjusted P = 7.1 × 10^−11, two-sided
t-test, Supplementary Data [170]3a), was a favorable prognostic feature
in our cohort (P = 1.3 × 10^−4, log-rank test, Supplementary
Data [171]2d) and the combined CPTAC dataset (P = 1.6 × 10^−2, log-rank
test). Given that the loss of CDKN1B increases the risk of cancer
metastasis^[172]31, the reduced expression of CDKN1B implies an
advanced phenotype of Subtype 4.
We evaluated the clinical significance of Subtype 4 according to the
survival rates of patients with this subtype. Subtype 4 in the Korean
NSCLC cohort was associated with a significantly poorer survival rate
compared to that in the other subtypes (P = 8.4 × 10^−3, log-rank test,
Fig. [173]2h), possibly influenced by stage distribution. Since Subtype
4 showed the highest rate of metastasis among the five subtypes,
metastasis may have an important impact on the poor survival outcomes.
Interestingly, there was no variation in the survival rates of patients
with and without metastasis in Subtype 4 (P = 9.5 × 10^−1, log-rank
test, Fig. [174]2i), suggesting that metastasis may not be the sole
mechanism leading to poorer survival. When comparing survival rates
among non-metastatic patients of the five subtypes, we observed a
significant disadvantage for patients with Subtype 4 (P = 2.6× 10^−2,
log-rank test, Fig. [175]2j), indicating that other biological factors
may contribute to the poor prognosis of this subtype. In contrast, we
did not find any significant difference in overall survival between
Subtype 4 and the other subtypes in the combined CPTAC cohort (P = 7.9
× 10^−1, log-rank test, Supplementary Fig. [176]2j), indicating the
effect of differences in ethnicity and treatment records.
Cellular landscape of the five subtypes of NSCLC
Exploring the tumor microenvironment is crucial for understanding the
mechanisms underlying cancer progression and for developing effective
therapeutic strategies to target not only cancer cells but also the
surrounding microenvironment. We assessed the tumor microenvironment of
the five NSCLC subtypes based on cell type specificity through a
comparative analysis of subtype-specific genes with a range of cell
type-specific genes using an integrated single-cell RNA (scRNA)
sequencing dataset^[177]23 of NSCLC patients. First, we computed the
differentially expressed genes (DEGs) for each subtype compared to the
NAT samples (Supplementary Data [178]3b). As expected, we found that
the DEG sets of all five subtypes were significantly upregulated in
tumor cell types (FDR < 0.001, permutation; Supplementary Fig. [179]3a
and Supplementary Data [180]3c), highlighting the overall degree of
tumorigenicity of the subtypes.
Next, we identified DEGs for each subtype by comparing all subtypes
(DEG[subtype], e.g., Subtype 1 vs. other subtypes) (Supplementary
Data [181]3d). In Subtypes 1 and 3, the DEG[subtype] was enriched for
cell types related to tumors (FDR < 0.01, permutation): Subtype 1
corresponding to LUAD and LUAD mitotic cell types and Subtype 3
corresponding to LSCC and LSCC mitotic cell types as defined^[182]23,
suggesting actively proliferating tumor cell components (Fig. [183]3a,
Supplementary Fig. [184]3b and Supplementary Data [185]3e). In
contrast, the DEG[subtype] sets of Subtypes 2, 4, and 5 were enriched
in neutrophils. While a high proportion of NAT-associated neutrophils
(NANs) was present in these subtypes (Supplementary Data [186]3e), the
tumor-associated neutrophil (TAN) cell types were enriched in Subtypes
4 and 5, with the largest proportion in Subtype 4. Furthermore, the
DEG[subtype] set of Subtype 2 appeared to be enriched for alveolar-type
fibroblasts and endothelial cells, implying that these tumors have
stromal components similar to those of the normal alveolar interstitium
(Supplementary Data [187]3e). Most immune-related cell types were
enriched in the DEG[subtype] set of Subtype 5 (Supplementary
Data [188]3e), reflecting a high proportion of TIL patterns.
Fig. 3. Landscape of cell type-specific subtype characteristics.
[189]Fig. 3
[190]Open in a new tab
a UMAP plot of single-cell types specific to Subtypes 1 to 5, using
each DEG[subtype]. The color of each point represents the module score
of each cell; the more relevant the module is to the cell type, the
higher the score and the redder the color. UMAP information was
obtained from the original study^[191]23. b Representative histologic
images of the subtypes. The tumor cell (T) and stromal (S) components
are separately labeled. Note the irregularly fused tumor glands in
subtype 1 tumor compared to those in subtype 2 tumors composed of
small, uniform tumor cells lying within the elastic stroma similar to
the normal alveolar wall. Dense stromal inflammatory cell infiltration
in Subtype 5 tumors. Scale bars for b = 100 μm. c Proportions of
samples with different pathologic diagnoses within each subtype. LUAD
was predominant in Subtypes 1 and 2, whereas LSCC was predominant in
Subtype 3. d–f Histologic patterns of LUADs in each subtype. The
predominant patterns of Subtype 1 and 2 LUADs were most commonly acinar
or papillary but were quite heterogeneous (d). The proportion of the
lepidic pattern, considered to indicate noninvasive LUAD, was enriched
mostly in Subtype 2 LUADs, suggesting that Subtype 2 is most like early
LUAD (Subtype 1, n = 55; Subtype 2, n = 34, Subtype 3, n = 10; Subtype
4, n = 26; Subtype 5, n = 20) (e). Consistently, the proportion of
high-grade histologic patterns (including solid, micropapillary,
cribriform, and complex glandular patterns) was lowest in Subtype 2
LUADs. The high-grade histologic pattern was more extensive in Subtype
1 than in Subtype 2, but these subtypes were remarkably heterogeneous
compared to Subtype 3–5 LUADs, which were mostly composed of high-grade
histologic patterns (Subtype 1, n = 55; Subtype 2, n = 34, Subtype 3,
n = 10; Subtype 4, n = 24; Subtype 5, n = 20) (f). For box-plots,
middle line, median: box edges, 25th and 75th percentiles; whiskers,
most extreme points that do not exceed ±1.5 × IQR. g–I Lymphovascular
invasion (g), lymph node metastasis (h), and tumor necrosis (i) were
less common in Subtype 2 tumors, which also implies that Subtype 2
tumors are in a clinically early, nonprogressed stage. j
Microscopically, the stromal component was more extensive in Subtype 2,
4, and 5 tumors (Subtype 1, n = 55; Subtype 2, n = 43, Subtype 3,
n = 52; Subtype 4, n = 43; Subtype 5, n = 34). For box-plots, middle
line, median: box edges, 25th and 75th percentiles; whiskers, most
extreme points that do not exceed ±1.5 × IQR. k–l Tumor-infiltrating
lymphocytes (k) and stromal neutrophilic infiltration (l) were most
extensive in Subtype 5 tumors. The p-value was calculated using the
chi-square test (c, d, g–I, k, and l) the Kruskal-Wallis test (e, f,
and j). m Summary of the histopathologic characteristics of the NSCLC
subtypes.
We correlated these findings with the histopathological features of the
subtypes (Fig. [192]3b–m). Subtypes 1 and 2 were predominantly
well-to-moderately differentiated adenocarcinomas, in which acinar and
papillary patterns were predominant (Fig. [193]3c, d). Comparisons of
the histological patterns for LUAD cases showed that the lepidic
pattern, reminiscent of the normal alveolar structure of the lung
parenchyma and a non-invasive component of lung adenocarcinoma^[194]32,
was more extensive in Subtype 2 (Fig. [195]3e), confirming the
DEG[subtype] findings. In contrast, Subtype 1 had a higher proportion
of high-grade histologic patterns, including solid, micropapillary,
cribriform, and complex glandular patterns^[196]33 compared to those in
Subtype 2 (Fig. [197]3f). Consistent with these observations, Subtype 2
tumors exhibited less frequent lymphovascular invasion (Fig. [198]3g)
and lymph node metastasis (Fig. [199]3h). Subtype 3 mainly consisted of
squamous cell carcinomas and a subset of solid-predominant
adenocarcinomas (Fig. [200]3c, d). Subtype 3 tumors had enlarged
pleomorphic nuclei with high mitotic activity and frequent tumor
necrosis (Fig. [201]3i), which was consistent with the DEG[subtype]
analysis. In contrast to Subtypes 4 and 5, the stromal components of
Subtype 3 tumors were relatively scarce (Fig. [202]3b–j). Subtype 4
tumors had moderate-to-large amounts of desmoplastic stromal components
with variable amounts of inflammatory cells, whereas Subtype 5 tumors
exhibited high tumor infiltration by immune cell components in
approximately half of the cases (Fig. [203]3k). Likewise, in the
DEG[subtype] analysis, tumors of Subtypes 4 and 5 were more likely to
be accompanied by moderate-to-high stromal neutrophilic infiltration,
but the proportion of such tumors was not high (Fig. [204]3l).
Collectively, we obtained representative histological images of the
five subtypes that reflected their histopathological characteristics
and clinical relevance.
Proteogenomic features underlying whole-genome doubling in NSCLC subtypes
The proliferative subtype is common in NSCLC and is mainly
characterized by the upregulation of cell cycle pathways, including E2F
target, G2M checkpoint, and the MYC targets v1 and v2. Previous
proteogenomic study^[205]10 reported a proliferative subtype of LUAD,
the dominant proximal-proliferative cluster, which accounts for 27% of
patients with LUAD (30/110 samples; Supplementary Fig. [206]4a).
Similarly, a proliferative subtype, called the proliferative-primitive
cluster, affects 16% of patients with LSCC (17/108 samples,
Supplementary Fig. [207]4b). When considering the classical subtype,
which also exhibits upregulation of cell cycle pathways, proliferative
subtypes collectively constituted approximately 44% of patients with
LSCC^[208]13 (47/108 samples).
Our multiomics analysis identified Subtype 3 as a proliferative subtype
characterized by a WGD event. In Subtype 3, 75% of the patients (39/52)
exhibited WGD (Fig. [209]4a and Supplementary Data [210]4a, c).
Copy-number signature analysis revealed a significant association
between Subtype 3 and the CN14 signature (OR: 8.4, Fisher’s exact test;
Fig. [211]4b and Supplementary Fig. [212]4d), which is indicative of
high ploidy (absolute copy numbers 3 to 8), relatively large segment
size ( > 40 Mb), and whole-chromosome or chromosome arm-scale losses
prior to a single genome-doubling event. In particular, Subtype 3
showed amplification on chromosome 3q and significant enrichment for
the co-occurrence of SOX2 amplification and TP53 mutations compared to
the other subtypes (OR: 16.4, Fisher’s exact test; Fig. [213]4c, d and
Supplementary Data [214]4e, f). Furthermore, Subtype 3 in the combined
NMF analysis, which integrated the CPTAC LUAD and LSCC datasets also
showed a high frequency of WGD (64%, 41/64; Fig. [215]4a and
Supplementary Data [216]4b, d) and enrichment of the CN10 (28%, 18/64)
and CN16 signatures (17%, 11/64), which indicated focal and chromosomal
losses before single- and twice-genome doubling, respectively
(Supplementary Fig. [217]4c and [218]e). Subtype 3 exhibited a greater
TMB compared to that in the other subtypes in both our (P = 5.7 ×
10^−11, Wilcoxon rank-sum test) and previous studies (P = 6.9 × 10^−6,
Wilcoxon rank-sum test). Although LSCC is the most common histology
(67%, 35/52 in Korean NSCLC and 66%, 42/64 in CPTAC NSCLC), Subtype 3
is an NSCLC tumor characterized by the presence of WGD events due to
copy number alterations and cell cycle pathway enrichment.
Fig. 4. Proteogenomic features underlying whole-genome doubling (WGD) in
NSCLC subtypes.
[219]Fig. 4
[220]Open in a new tab
a Barplot showing WGD fraction in each multiomics subtype from our
study and CPTAC NSCLC patients. b Overlap of copy number signatures for
the five multiomics subtypes, with the colors indicating the odds ratio
from one-sided Fisher’s exact test. The COSMIC v3 signature and
etiology for each signature are indicated on the y-axis. *P < 0.05,
**P < 0.01, ***P < 0.001, ****P < 0.0001 (one-sided Fisher’s exact
test). c Top 10 most significantly enriched copy number variations
(CNVs) in Subtype 3 (FDR < 0.1). The y-axis indicates the cytoband and
the x-axis shows the log[10]-scaled FDR from linear regressions
comparing Subtype 3 and other samples. d Enriched co-mutations in each
subtype in both our and the CPTAC cohorts. EGFR mutation, missense
mutation, in-frame deletion, frameshift deletion, and amplifications
were counted. For TP53 and CDKN2A, amplifications were excluded, and
for SOX2, only amplifications were counted. *P < 0.05, **P < 0.01,
***P < 0.001, ****P < 0.0001 (one-sided Fisher’s exact test). e
Protein-level gene set enrichment analysis (GSEA) revealing upregulated
and downregulated pathways in Subtype 3 of both cohorts. The x- and
y-axis are enrichment scores (ES) from the current study and CPTAC
NSCLC data, respectively. Labeled pathways are the top-5 upregulated
pathways in Subtype 3. The Molecular Signatures Database (MSigDB)
hallmark gene set v7.4 was used for GSEA. f Subtype 3-specific kinase
activity scores. The sizes of the points indicate -log[10](FDR) from
kinase activity estimation. Significantly up- and downregulated kinases
are labeled (FDR < 0.05). g Elevated protein expression of XPO1 in
Subtype 3 is shown for our study (Subtype 1, n = 55; Subtype 2, n = 45;
Subtype 3, n = 52; Subtype 4, n = 43; Subtype 5, n = 34) and CPTAC LSCC
(Subtype 1, n = 11; Subtype 2, n = 19; Subtype 3, n = 42; Subtype 4,
n = 15; Subtype 5, n = 21). Kruskal-Wallis test was performed to test
the differences in expression. WGD status is marked by red dots and the
y-axis shows log[2] protein expression levels. For box-plots, middle
line, median; box edges, 25^th and 75^th percentiles; whiskers, most
extreme points that do not exceed ± 1.5 × IQR. h Sample information
(top), drug response curve (middle), and IC50 (bottom) for selinexor
(XPO1 inhibitor) for lung organoids highlighting a higher sensitivity
in WGD-positive LSCC organoids. Three technical replicates were tested
in each organoid sample. For IC50 barplot, dots indicate each
replicate, and error bars indicate average ± 1 standard deviation. i
WGD-related pathway underlying Subtype 1 LUAD tumors and Subtype 3 LSCC
tumors. Significantly upregulated kinases are highlighted with red
triangles (FDR < 0.05) and mutations are shown in purple boxes. Kinase
activity scores are estimated from phosphoprotein expression. The
log[2]fold changes from DE analyses are indicated by the color in each
box. For the phosphoproteome, only features with FDR less than 0.1 are
displayed.
Consistent with previous studies, Subtype 3 showed upregulation of
proteins, PTMs, and kinases in cell cycle pathways (Fig. [221]4e, f,
Supplementary Fig. [222]4f, g and Supplementary Data [223]3a,
[224]4g–j). The key proteins and PTMs involved in this subtype included
SRSF1 (proteins S199 and K179), SRSF2 (T25 and S26), and XPO1 (protein
K693). SRSF1 and SRSF2, both proto-oncogenes, modulate the splicing
patterns of tumor suppressor genes, kinases, and kinase receptors into
their oncogenic isoforms, thereby promoting cell cycle
activity^[225]34. XPO1 was upregulated at both the protein and
acetylated protein levels in this subtype (Fig. [226]4g).
Overexpression of XPO1 has been observed in many malignancies^[227]35
and its inhibition could potentially reduce cell proliferation and
promote cell cycle arrest in proliferative subtypes. To assess this
hypothesis, we selected tumor organoids from the organoid biobank of SG
Medical, Inc. (Seoul, Korea) generated from WGD-positive LUAD and LSCC
patients and WGD-negative controls. After treating the organoids with
selinexor, a targeted XPO1 inhibitor, we found higher drug sensitivity
in WGD-positive LSCC organoids than in other samples (Fig. [228]4h and
Supplementary Data [229]4k). This finding suggests that targeting XPO1
with selinexor may be a promising therapeutic approach for patients
with LSCC and WGD tumors, which warrants further investigation into its
potential clinical applications.
To a lesser extent, Subtype 1 also included patients with WGD (51%,
28/55 patients; Fig. [230]4a). Subtype 1 was significantly enriched in
the CN15 signature (OR: 4.8, Fisher’s exact test; Fig. [231]4b), evenly
distributed ploidy (absolute copy number from 2 to more than 9), and a
segment size of 1 to 10 Mb. This subtype was predominantly found in
patients with LUAD and was associated with co-occurring mutations in
EGFR and tumor suppressor genes (TP53 or CDKN2A) (OR: 5.3, Fisher’s
exact test; Fig. [232]4d), implying that subtype 1 is LUAD-prevalent
and WGD is activated through LUAD-specific tumor evolution events. In
the combined NMF analysis, we validated the proportion of WGD-positive
patients (46%, 19/41 patients; Fig. [233]4a) and the enrichment of the
WGD-related signature CN16 in Subtype 1 from previous LUAD and LSCC
studies (34%, 14/41 patients; Supplementary Fig. [234]4c).
WGD characterizes two major subtypes of NSCLC but seems to harbor
different driver genes and underlying pathways for each subtype
(Fig. [235]4i and Supplementary Data [236]4l). Subtype 1 was
characterized by a high rate of in-frame deletion and copy number gain
in EGFR, with a high mutational burden on tumor suppressor genes.
Significantly upregulated kinases in Subtype 1 included RPS6KA3, which
mediates cell growth signaling initiated by EGFR activation, CDK1,
CDK2, and CDK4, indicating an accelerated cell cycle (FDR < 0.1)
(Supplementary Fig. [237]4h and Supplementary Data [238]4m).
Additionally, CDK12 and PIK3C3, which are related to DNA repair and
autophagy, respectively, were significantly upregulated (FDR < 0.1).
Since increased cell proliferation induces hypoxia and nutrient
depletion^[239]36, these kinases may be activated in response to WGD to
promote the cell’s adaptation to the nutrient-deprived environment and
stabilize the genome, ultimately leading to cell survival. Similarly,
together with CDK1, CDK2, and CDK4, kinases related to the DNA damage
and nutrient deprivation responses were activated in Subtype 3 LSCC
samples (Supplementary Fig. [240]4i and Supplementary Data [241]4m).
However, the putative driver mutations of WGD in Subtype 3 are TP53
mutations and copy number amplification on chromosome 3q, where many
cell cycle genes reside, including SOX2, ATR, STAG1, GSK3B, TFDP2, and
MCM2. Since LUSC tumors tend to have copy number gain in the 3q
arm^[242]37, Subtype 3 may have benefited from a 3q gain after WGD,
thus having a higher WGD fraction than Subtype 1 samples. In addition,
unlike Subtype 1, CHEK1 and EEF2K play important roles after undergoing
WGD. EEF2K is known to inactivate EEF2 by phosphorylation under dietary
restriction, which in turn contributes to cell survival^[243]38.
Overall, these results show that Subtypes 1 and 3 represent
proliferative subtypes that have undergone WGD in LUAD and LSCC tumors,
respectively, and that selinexor may be an effective treatment for
patients with Subtype 3 will requires further validation study.
Heterogeneous immune landscapes in NSCLC
Understanding the tumor immune microenvironment (TIME) is crucial in
the molecular characterization of cancer subtypes^[244]39,[245]40. To
profile the TIME in patients with NSCLC, two clustering analyses based
on cell type^[246]41 and pathways^[247]42 were performed, with
enrichment scores inferred from curated gene signatures. We identified
three major immune clusters, namely, hot-tumor-enriched (HTE),
cold-tumor-enriched (CTE), and NAT-enriched, across 205 tumors and 85
NATs (Fig. [248]5a and Supplementary Data [249]5a). HTE and CTE immune
clusters overlapped considerably between the cell type- and
pathway-based clusters (Supplementary Fig. [250]5a and Supplementary
Data [251]5b). HTE tumors were mainly enriched in CD8+ and CD4 + T
cells, regulatory T (Treg) cells, B cells, natural killer (NK) cells,
neutrophils, dendritic cells (DCs), monocytes, and macrophages at the
cell-type level, compared to CTE tumors. In the pathway-based
clustering results, the HTE cluster showed a greater activation of
immune-related pathways than the CTE cluster. Because the cell-type
enrichment score was inferred based on curated gene signatures, we
confirmed the cell-type specificity of the immune clusters using
scRNA-seq data from multiple NSCLC studies^[252]23. From the top-300
DEGs in the HTE (DEG[HTE]) and CTE (DEG[CTE]) clusters, we investigated
the enrichment of each DEG in scRNA-seq data. The DEG[HTE] set was
enriched in CD8+ and CD4 + T cells, regulatory T cells, B cells, NK
cells, neutrophils, DCs, monocytes, and macrophages, whereas the
DEG[CTE] set was primarily enriched in epithelial cells, including LUAD
and LSCC tumor cells (Fig. [253]5b, top, and Supplementary
Data [254]5c). Additionally, the DEGs of each cell type inferred from
the gene signatures were enriched in the corresponding cell types in
the scRNA-seq data (Supplementary Fig. [255]5b). The cell-type
enrichment score was also highly correlated with the level of TILs
(Fig. [256]5b, bottom, and Supplementary Data [257]5d). These results
indicate that the single-cell and immunohistochemical analyses achieved
good immune clustering based on the cell type score.
Fig. 5. Landscapes of immune clusters and cell types across NSCLC subtypes
and cohorts.
[258]Fig. 5
[259]Open in a new tab
a Immune subtyping based on cell type and pathway enrichment scores.
Cell type-based clustering was performed with 205 tumor and 85 normal
adjacent to the tumor (NAT) samples, and pathway-based clustering was
performed using only tumor samples. The tumor-infiltrating lymphocyte
(TIL) pattern, clinical histology (diagnostics, DX), multiomics
subtype, tumor stage, and tissue information are described. IC, immune
cluster. b (top) DEG[HTE] and DEG[CTE] were used to generate the UMAP
plot of scRNA-seq data. A two-sided t-test was conducted to assess the
statistical significance of the differences in gene expression. The
color of each point represents the module score of each cell; higher
scores are shown in red. UMAP information was obtained from multiple
NSCLC studies^[260]23. (bottom) The correlations of 10 cell types and
immune clusters with the pattern of TILs were analyzed. The sizes and
colors of the circles indicate the statistical significance and
correlation coefficient of the correlations, respectively. The
horizontal black dotted line indicates P = 0.05. c Hazard ratios for
overall survival (OS, left) and relapse-free survival (RFS, right)
related to various cell types in the cell type-based immune cluster. A
hazard ratio lower than zero (blue box) indicates that the
hot-tumor-enriched (HTE) status or a high cell type score was
associated with prolonged survival. Error bars (gray lines) represent
mean ± 95% confidence interval (CI). Red text indicates statistical
significance in the survival analysis by the log-rank Mantel‒Cox test
(n = 174). d Correlations of the RNA expression, protein expression,
and protein activity of 10 immunomodulators with immune cluster status
for our cohort as well as other lung cancer multiomics
cohorts^[261]10,[262]13. Correlation coefficients and p-values were
obtained from a generalized linear model (GLM). e Correlations between
the expression or activity of immunomodulators and the status of driver
mutations in our cohort and the Satpathy and Gillette cohorts. The top
associations between immunomodulators and known driver genes are
described. f The left boxplot shows the RNA and protein expression of
SLAMF7 in samples (n = 205) with wild-type or mutant SMARCA4 (n = 205);
the right boxplot shows the RNA and protein expression of SLAMF7 in HTE
and cold-tumor-enriched (CTE) samples (n = 174). The two-sided t-test
was performed to test the differences in expression. The box represents
the 25th and 75th percentiles, the central mark denotes the median, and
the whiskers extend to the most extreme points within ±1.5 × IQR. g
(left) Box (top) and balloon (bottom) plots showing the mean expression
of marker genes of HTE and 10 cell types across the multiomics subtypes
(n = 174). The marker genes were defined as the top−300 and −30 most
overexpressed genes in HTE samples and highly cell type-enriched
samples, respectively. (right) The bar (top) and box (middle and
bottom) plots show the mutation frequency of SMARCA4 and RNA/protein
expression of SLAMF7 across multiomics subtypes (n = 174),
respectively. The Kruskal‒Wallis test was performed to assess the
differences among the multiomics subtypes. The box represents the 25th
and 75th percentiles, the central mark denotes the median, and the
whiskers extend to the most extreme points within ±1.5 × IQR.
To analyze the prognostic value of the immune landscape in patients
with NSCLC, each immune cell enrichment score, along with the immune
cluster, was tested against overall and relapse-free patient survival
rates (Fig. [263]5c, Supplementary Fig. [264]5c and Supplementary
Data [265]5e). HTE status was the most favorable factor for patient
survival, and high enrichment of CD8+ and CD4 + T cells, B cells, and
NK cells were positively correlated with a good prognosis, in contrast
to the enrichment of epithelial cells or neutrophils, which had
negative correlations. None of the immune-related pathway-based
enrichment scores were significantly associated with patient survival,
although most showed a positive association with prolonged survival
(Supplementary Fig. [266]5d). HTE tumors with high enrichment of CD8+
and CD4 + T cells and immune-related pathways are known to be
associated with favorable prognoses^[267]43. Additionally, Previous
studies^[268]10,[269]13 reported that Tregs were enriched in HTE
tumors, but had an immunosuppressive effect and were associated with
poor prognosis. We found a concordant trend of Treg enrichment in HTE
tumors (Fig. [270]5a, b); patients with Treg-enriched HTE had worse
survival rates than those with low Treg levels (Supplementary
Fig. [271]5e and Supplementary Data [272]5f).
Further analysis of the immune clusters was performed based on their
cell-type composition, which showed significant correlations with
patient survival. We performed a pathway enrichment analysis comparing
the HTE and CTE immune clusters using multiomics features
(Supplementary Fig. [273]5f and Supplementary Data [274]5g). A set of
immune-related and EMT signaling pathways was enriched in HTE tumors,
whereas cell cycle-related pathways and glycolysis were enriched in CTE
tumors. Specifically, some metabolic pathways and oxidative
phosphorylation were enriched in CTE tumors at the phosphoprotein level
but were enriched in HTE tumors at the acetylated-protein level. Many
of these results are consistent with the findings of previous lung
cancer multiomics studies^[275]10,[276]13.
The differences between HTE and CTE immune clusters may be better
elucidated by examining their respective regulators. To identify
putative regulators of patient immune landscapes, we analyzed RNA and
protein expression data and inferred protein regulatory activity
according to the systematical influence of a protein in the
transcription of relevant targets^[277]44. We performed a regression
analysis of gene expression or activity and enrichment scores for cell
type or immune clusters. We found that the majority of
immunomodulators, which mainly comprised cancer cell ligands and immune
cell receptors, were positively correlated with HTE tumors and most
immune cells, but not epithelial cells^[278]45 (Supplementary
Fig. [279]5g and Supplementary Data [280]5h). Notably, the expression
and activity of VEGF-A were negatively correlated with HTE tumors,
which contrasts the patterns observed with other immunomodulators.
VEGF-A was negatively associated with TILs and has potential
implications for cancer risk^[281]46,[282]47. Specifically, 10
immunomodulators showed significant correlations with immune cluster
status, with a consistent correlation direction not only in our cohort,
but also in other lung cancer multiomics studies^[283]10,[284]13
(Fig. [285]5d and Supplementary Data [286]5i).
We profiled the associations between the immunomodulators and known
driver mutations to determine the putative mechanism(s) underlying
changes in immunomodulators. Because Subtype 5 is immunogenic, we only
considered the most activated or repressed immunomodulators to identify
the potential regulatory mechanisms that distinguish it from the other
subtypes. Ultimately, the three immunomodulators showed significant
correlations with mutations in one of the eight driver genes, with
consistent patterns observed in other multiomics NSCLC
cohorts^[287]10,[288]13, although these correlations were not
statistically significant (Fig. [289]5e and Supplementary
Data [290]5j). Among these immunomodulators, the SMARCA4 mutation was
positively correlated with the expression of SLAMF7 at both the RNA and
protein levels, and positively correlated with HTE status in both our
study cohort and the independent cohort^[291]10 (Fig. [292]5f,
Supplementary Fig. [293]5h, and Supplementary Data [294]5k). Based on
these results, SMARCA4 is a potential regulator of the immunomodulator
SLAMF7, which is associated with HTE status.
We further analyzed the distribution of the immune landscape among the
multiomics subtypes (Fig. [295]5g, left, and Supplementary
Data [296]5l). The expression of HTE feature genes (DEG[HTE]) was
highest in Subtype 5, which had immunogenic characteristics. This
subtype also showed the highest activation of CD8+ and CD4 + T cells,
Treg cells, B cells, NK cells, neutrophils, and macrophage marker
genes. Neutrophils and Tregs, which are associated with a poor
prognosis in all patients and specifically in THE patients,
respectively (Fig. [297]5c and Supplementary Fig. [298]5e), were also
found to be enriched in Subtype 5. This enrichment may be linked to the
intermediate survival characteristics observed in this subtype, despite
Subtype 5 exhibiting the strongest THE tumor signal. Moreover, the
highest frequencies of SMARCA4 mutations and SLAMF7 overexpression were
observed in Subtype 5 (Fig. [299]5g, right). We confirmed that the
distribution of cell types within the immune cluster and their
correlation with the multiomics subtype were consistent in the
integrated cohort combining our cohort with those of recent multiomics
studies^[300]10,[301]13 (Supplementary Fig. [302]5i and Supplementary
Data [303]5m). Our results showed that HTE tumors were positively
associated with better prognosis in patients with NSCLC. These tumors
were enriched in immunogenic Subtype 5, which was also positively
associated with mutations in the putative regulator SMARCA4, which
targets SLAMF7, an immunomodulator.
Multiomics profiling of neoantigens and immune clusters
Neoantigens are tumor-specific antigens generated by tumor cells and
are key factors affecting the immune landscape of cancer
patients^[304]48. Therefore, we predicted neoantigens in our NSCLC
cohort using a multiomics dataset. Neoantigens derived from somatic
mutations in the coding regions were predicted using WES data and
confirmed by MS analysis to determine whether they were expressed as
peptides of the “confirmed neoantigen candidates.” Furthermore, we
identified peptides derived from non-coding and non-annotated
transcripts that were distinct from the canonical neoantigens. Among
them, we selected “cryptic MHC class I-associated peptides (MAPs),”
which were described as noncanonical peptides predicted to bind to MHC
class I molecules in a previous study^[305]49–[306]51. We identified
“confirmed cryptic MAPs” based on a previously determined expression
threshold^[307]52. We inferred 85,430 neoantigen candidates and 775
cryptic MAPs (Supplementary Fig. [308]6a) and annotated the origin of
the cryptic MAPs based on the matched transcripts (Supplementary
Fig. [309]6b and Supplementary Data [310]6a). Non-annotated isoforms,
pseudogenes, and untranslated regions (UTRs) accounted for 90.97% of
the sources of cryptic MAPs, which is consistent with previous
studies^[311]53–[312]58.
We tested the associations of the neoantigen candidates, confirmed
neoantigen candidates, cryptic MAPs, and confirmed cryptic MAPs with
patient survival (Fig. [313]6a and Supplementary Data [314]6b).
Interestingly, only confirmed cryptic MAPs showed a strong positive
correlation with improved survival, although the number of cryptic MAPs
was notably low across patients (Supplementary Fig. [315]6c and
Supplementary Data [316]6b). We also found 12 confirmed cryptic MAPs in
more than three patients, called them recurrent cryptic MAPs, some of
which were derived from the same gene of origin (Supplementary
Fig. [317]6d and Supplementary Data [318]6c). Furthermore, the presence
of recurrent cryptic MAPs was significantly correlated with prolonged
survival (Fig. [319]6b and Supplementary Data [320]6b), indicating the
prognostic value of cryptic MAPs in patients with NSCLC.
Fig. 6. Clinical relevance of neoantigens and cryptic MAPs, and their
associations with multiomics subtypes.
[321]Fig. 6
[322]Open in a new tab
a Survival estimated according to the type of neoantigen and cryptic
MAP. A lower hazard ratio (blue box) indicates that a high load of
neoantigens or cryptic MAPs is associated with prolonged overall
survival, and a high hazard ratio (red box) indicates the opposite.
Hazard ratios for individual trials and overall effects are given with
95% CIs. Log (HR) values and their corresponding 95% confidence
intervals (CIs) are depicted in grey. b Kaplan‒Meier curve showing the
survival of two groups of patients (n = 204) according to whether they
did (blue line) or did not have (red line) recurrent cryptic MAPs. The
p-value was derived by comparing the curves with the log-rank
Mantel‒Cox test. c Correlations between the number of cryptic MAPs and
enrichment scores of 10 cell types and the immune cluster. Correlation
coefficients were calculated by a linear regression model with the
covariates of sample batches and histological diagnosis. The size of
the dots indicates the degree of the -log[10]-scaled p-value, and the
color of the dots represents the strength of the correlation
coefficient. The bold-lined dots indicate statistical significance. d
Kaplan‒Meier curve showing the survival patterns of four groups of
patients (n = 174) stratified by cryptic MAP load and immune cluster.
The p-value was obtained by comparing curves between the two groups
with the largest difference in the log-rank Mantel‒Cox test. e
Enrichment analysis of the four groups described in Fig. [323]4d for
the multiomics subtypes. The x- and y-axis indicate enrichment and
statistical significance calculated using a two-sided Fisher’s exact
test with the Benjamini‒Hochberg adjustment, respectively. The size of
each dot indicates the level of significance. f Features of patients
with multiomics Subtype 5 disease who had a low cryptic MAP load with
an HTE status, activated APM, and activated NF-κB pathway. g
Kaplan–Meier curve of recurrence-free survival according to treatment
status (chemotherapy [CTx] or chemoradiation therapy [CRTx]) in
patients categorized by multiomics subtype. The p-value was derived by
comparing the curves with the log-rank Mantel‒Cox test.
We also investigated the association between cryptic MAPs, immune
clusters, and cell-type enrichment. Cryptic MAPs were positively
correlated with most immune cell types, including HTE status, although
some correlations were weak or insignificant (Fig. [324]6c and
Supplementary Data [325]6b). In our cohort, patient prognosis was
evaluated after stratification according to immune cluster criteria and
cryptic MAP load. Patients with high cryptic MAP load and HTE status
showed the longest survival, whereas those with low cryptic MAP load
and CTE status showed the worst survival (Fig. [326]6d and
Supplementary Data [327]6b).
We determined the association between the integrated immune landscape
and multiomics subtypes by evaluating the distributions of the
multiomics subtypes based on the combined status of the cryptic MAP
load and immune clusters. Multiomics Subtype 5 was most enriched in the
low cryptic MAP load and HTE groups (Fig. [328]6e and Supplementary
Data [329]6d). Interestingly, a similar survival trend and multiomics
subtype distribution were obtained when the immune cluster was replaced
with antigen processing, presentation machinery (APM), and TIL pattern
(Supplementary Fig. [330]6e–g and Supplementary Data [331]6e, f), which
was proposed as a critical factor for evaluating the immune-activating
potential with neoantigens^[332]14. Notably, patients with Subtype 5,
who showed moderate survival, had a low cryptic MAP load with HTE and
activated APM. This is consistent with the large proportion of TILs
observed in single-cell and histopathological analyses
(Fig. [333]3a–k). Patients with Subtype 5 also showed high activity of
the TNF-α-pathway via NF-κB in addition to APM. Consequently, the
moderate survival associated with this subtype appeared to be linked to
a low load of cryptic MAPs and high enrichment of immunosuppressive
Tregs and neutrophils despite having an active immune landscape with
HTE status (Fig. [334]6f and Supplementary Data [335]1a). Remarkably,
the use of adjuvant chemotherapy or chemoradiation therapy
substantially enhanced the survival of patients with Subtype 5 tumors
(Fig. [336]6g). In contrast, no substantial improvement in survival was
observed in patients with other subtypes who underwent adjuvant
chemotherapy or chemoradiation therapy. Subtype 5 also demonstrated the
most favorable prognosis compared to other subtypes in the overall
population treated with adjuvant chemotherapy, despite the marginal
statistical significance (data not shown). This underscores that
subtype 5 could be associated with a clinical benefit from adjuvant
chemotherapy. These results imply that understanding patient prognosis
and multiomics subtypes requires a multifactorial consideration of the
TIME. Furthermore, the distribution of HLA alleles before and after the
binding prediction was consistent with previous reports on the Korean
population or lung cancer studies^[337]59–[338]62, thereby
corroborating the reliability of our results (Supplementary
Fig. [339]6h–j).
Discussion
In this study, we conducted a comprehensive multiomics analysis of 229
patients from a Korean NSCLC cohort and identified five NSCLC subtypes
enriched for WGD, oncogenes, metastasis, and the immune
microenvironment. The phosphoproteome dataset was the most informative
for subtype identification, contributing 80% (911 out of 1,134) of the
features, while global proteome and acetylome data also played crucial
roles in decoding signaling pathways across the identified subtypes. By
extending our analysis to integrate multiomics data from 462 patients
with NSCLC, we validated our subtype classification, confirming its
alignment with previously established multiomics subtypes. Utilizing an
extensive single-institute clinical dataset allowed us to delineate
detailed histopathological beyond histologic subtype and clinical
relevance of our multiomics subtypes, such as prevalent alterations of
targetable oncogenic drivers, histologic grade of LUAD, metastatic
potential, and tumor immune response, and finally complement previous
findings on the underlying biology of NSCLC.
We identified a NSCLC subtype, Subtype 4, which was associated with a
high frequency of metastasis and poor outcomes, independent of the
NSCLC histological type. This subtype contained a large proportion of
desmoplastic stromal components, suggesting that predominant tumor
invasion and metastasis contribute to its aggressive behavior.
Increased phosphorylation in hypoxia and activation of the PI3K-Akt
signaling pathway were key characteristics of this subtype. In
particular, the upregulation of SLK and LRRFIP1 phosphorylation
predicted poor survival outcomes and has been consistently observed in
other NSCLC multiomics studies. Further research on these events may
identify targets for therapeutic intervention and enhance our
understanding of poor outcomes in NSCLC.
Characterizing NSCLC according to multiomics subtypes has important
implications for personalized treatment strategies. Our study revealed
that Subtype 3, characterized by high levels of chromosomal instability
and XPO1 expression, represents a highly proliferative NSCLC subtype
with co-occurring TP53 and cell cycle gene amplification.
Patient-derived organoid experiments indicated that selinexor, a
targeted inhibitor, could be effective against Subtype 3 with WGD. In
Subtype 1, WGD only occurred in a limited number of patients with LUAD
(51%, 28/55 patients) and was enriched for co-occurring mutations in
EGFR and tumor suppressor genes. However, XPO1 expression was not
remarkably increased and selinexor did not show strong efficacy in this
subtype, suggesting potential subtype-specific differences in XPO1
expression and therapeutic response to XPO1 inhibition. These findings
suggest that WGD events are associated with tumor evolution via genomic
alterations specific to the subtype, even in WGD-predominant
subtype^[340]7,[341]63.
We demonstrated that Subtype 5, enriched in the inflammatory tumor
microenvironment, exhibited extraordinary improvements in survival with
conventional adjuvant chemotherapy. This finding implies that the
selection of postoperative adjuvant treatment for NSCLC could be
helpful only for the case with high tumor immune response.
Additionally, patients within this subtype could potentially benefit
from immune checkpoint inhibitor therapy. Identifying suitable patients
for coventional adjuvant chemotherapy could be facilitated by various
biomarkers indicating a positive tumor immune response, including PD-L1
immunostaining, T-cell marker presence, and the histological nature of
the tumor immune microenvironment. In contrast, adjuvant chemotherapy
did not significantly alter the survival of node-positive patients with
Subtypes 1 or 2, which are enriched in driver oncogene alterations.
Given the minimal survival improvement with cytotoxic chemotherapy in
EGFR-mutant LUAD observed in prior trials^[342]64,[343]65, targeted
therapies aimed at driver mutations may offer better outcomes for these
patients.
Multiomics-based profiling has the potential to enhance our
understanding of the TIME. The HTE signature exhibited significant
activation in Subtype 2, but was repressed in Subtype 4, which is
inconsistent with our histopathological observations. This highlights
the value of multiomics profiling as a complementary tool to
histopathological examination in the functional characterization of the
tumor microenvironment. We identified SLAMF7 as an immunomodulator
significantly associated with HTE and SMARCA4 mutations in Subtype 5.
Given the efficacy of immune checkpoint inhibitors in NSCLC patients
with SMARCA4 mutations^[344]66, SLAMF7, and SMARCA4 are promising
molecular targets for enhancing cancer immunotherapy. Our multiomics
analyses also revealed cryptic MAPs derived from non-coding and
non-annotated transcripts, which were subsequently confirmed using
proteomics. Unlike neoantigen candidates, cryptic MAPs were
significantly correlated with favorable survival and immune cell
enrichment. Thus, experimental validation of MHC-binding prediction
using immunopeptidomics for recurrent cryptic MAPs could facilitate the
development of a cancer vaccine.
Our study provides a comprehensive profile of the multiomics subtypes
of NSCLC, but there are several limitations to consider. First, we used
exome sequencing to infer whole-genome doubling, which may be less
accurate than whole-genome sequencing. Ethnic differences may also be
present in certain subtypes, particularly in Subtype 1 for EGFR female
patients, although some subtypes were consistently found in both our
cohort and the CPTAC cohorts. Additionally, the confirmation of cryptic
MAPs by proteomics is limited owing to the experimental availability of
normal samples. Further studies using larger normal sample cohorts are
needed to confirm these findings. A prospective cohort study is also
needed to test the effectiveness of drugs, such as selinexor, in the
various subtypes. Finally, more comprehensive studies are needed to
understand the effectiveness of selinexor in treating the WGD subtype.
Methods
Human subjects with investigation of clinicopathologic features
A total of 229 samples, self-reported as Korean ethnicity, were
histologically defined as NSCLC and selected for this study. Tumors and
normal tissues adjacent to the tumor (NAT) were harvested under the
Clinical Proteomic Tumor Analysis Consortium (CPTAC) guidelines from
Asan Medical Center, Korea, with the approval of the Institutional
Review Board of Asan Medical Center (Approval number: 2019-1210). In
brief, we retrieved the 408 NSCLC cases whose cold ischemic time of
fresh frozen tumor and NAT tissue sample was less than 15 minutes from
the NSCLC cases of the bioresource center of Asan Medical Center
deposited from January 2010 to March 2019. To investigate the impact of
proteogenomic findings on post-operative therapy and metastasis, we
preferentially selected 137 NSCLC patients showing lymph node
metastasis at the pathologic examination of resection specimens. We
further included 113 cases with absent lymph node metastasis at the
time of surgical resection. With the further exclusion of 21 cases
showing inadequate nucleic acid quality metrics, 229 patients were
finally enrolled in the study. The clinical information of the enrolled
patients, including age, gender, smoking history, adjuvant therapy,
presence of post-operative recurrence, and treatment histories, was
reviewed and documented by the thoracic oncologists (W.J.J, C.M.C., and
J.C.L.). All the histopathologic slides of the resection specimens
removed from the enrolled patients were independently reviewed by two
thoracic pathologists (H.S.H. and S.J.J.) and documented for pathologic
findings such as pathologic diagnosis, grade, tumor size, pleural
invasion, histologic pattern (for non-mucinous adenocarcinoma,
lymphovascular invasion, spreading through alveolar space, and lymph
node metastasis. The cases with discrepancy between the independent
assessments were discussed in consensus meetings. Based on these
findings, all enrolled patients were re-staged according to the 8th
lung cancer TNM stage^[345]67. In addition, pattern of
tumor-infiltrating lymphocytes (TIL) was reviewed by the pathologists
and classified according to the classification by Saltz et al.^[346]68.
Whole exome sequencing
Genomic DNA was isolated from a FFPE (formalin-fixed paraffin embedded)
tumor and normal samples (NATs or blood buffy coat^[347]69 if NATs are
not available). We generated whole exome sequencing (WES) libraries
using SureSelect V6-Post (Agilent, CA, USA). Pooled libraries were run
on the Illumina NovaSeq to obtain an average of 300x depth per tumor
library and 100x depth per NAT and blood buffy coat library. The raw
Illumina BCL (base calls) binary files were converted into Fastq files
using the Illumina package bcl2fastq. Sequencing read data was checked
for quality and adaptor/overrepresented sequence (ORS) by FastQC
(v0.11.9). There was no ORS that could be a quality problem, and since
the adapter sequence was less than 1% in all samples, the additional
trimming process was skipped. Fastq files were aligned to the human
reference GENCODE GRCh38.p13 v32 primary assembly genome by BWA
(v0.7.17). Among the three algorithms of BWA, the latest BWA-MEM was
selected as it is suitable for Illumina sequences ranging from 70 bp to
1 mbp with fast and accurate performance. After alignment, we sorted
the BWA process output by coordinates via Picard SortSam and checked
duplicate reads using Picard MarkDuplicates (v2.23.1). Two types of
duplicated reads, optical duplicates (incorrectly detected as multiple
clusters by the optical sensor of the sequencing machine) and library
duplicates (created during PCR library preparation), were annotated for
downstream analyses. For all tools requiring interval files, we used
V6_S07604514_hs_hg38_S07604514_Covered.bed provided by Agilent. Because
of systematic errors from the sequencing machine, base quality scores
were recalibrated by the GATK4 BQSR algorithm (v4.1.8.0). The algorithm
calculates the average Phred score by assuming the mismatch base is an
error by matching known variants with the data base. We used the known
variants VCF from dbSNP
(broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf,
broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz) and the
Mills and 1000 G project
(broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) from
the GATK4 bucket
([348]https://console.cloud.google.com/storage/browser/genomics-public-
data/resources/broad/hg38/v0/). Then, the empirical Phred score is
calculated using the average Phred score and then recalibration is
performed based on the empirical Phred score.
Variant calling
We called somatic variants from the recalibrated BAMs using GATK4
Mutect2 (v4.1.8.0). In the Mutect2 pipeline,
somatic-hg38_1000g_pon.hg38.vcf.gz data was used as panel-of-normals
and somatic-hg38_af-only-gnomad.hg38.vcf.gz data was used for germline
variants removal. We performed tumor with paired normal mode to exclude
germline variants. As one tumor sample had no matching normal sample,
the tumor only mode was performed. After calling somatic variants from
GATK4 Mutect2, a read support reference for the well-known variant
sites was created from the tumor recalibrated BAMs using GATK4
GetPileupSummaries. The tool requires a common somatic mutation sites
VCF, so we used the somatic-hg38_small_exac_common_3.hg38.vcf.gz file
created from the gnomAD resource from the GATK4 bucket
([349]https://console.cloud.google.com/storage/browser/genomics-public-
data/resources/broad/hg38/v0/). The GetPileupSummaries table was used
in the GATK4 CalculateContamination to calculate the fraction of reads
deriving from cross-sample contamination. The calculated cross-sample
contamination data were saved as a table for each sample, and tumor
segmentation data by minor allele fraction were additionally collected.
The raw output of Mutect2 was converted into filtered VCF using
cross-sample contamination table and tumor segmentation data by GATK4
FilterMutectCalls. The VCFs generated from GATK4 FilterMutectCalls were
converted to annotated VCF and MAF files via GATK4 funcotator. We used
GATK4 4.2.0.0 version solely for the annotation step to use the latest
annotation data source for GATK4 funcotator. The latest pre-packaged
data sources (funcotator_dataSources.v1.7.20200521 s) was downloaded
from the GATK4 bucket (gs://broad-public-datasets/funcotator/). All
annotations were performed in CANONICAL transcript-selection-mode after
choosing a custom transcript list. Finally, the annotated MAF files
were merged into one using maftools v2.8.05^[350]70.
Germline variants were called from the recalibrated BAMs using GATK4
HaplotypeCaller (v4.1.8.0). The reference confidence scores were
confirmed in GVCF mode using the option -ERC GVCF, which was a
reference model emitted with condensed non-variant blocks. For variant
calls, -G StandardAnnotation and -G AS_StandardAnnotation annotation
options were applied. All single sample GVCFs from GATK4
HaplotypeCaller were imported into GenomicsDB by GATK4 GenomicsDBImport
(v4.1.8.0) for joint genotyping. Joint genotyping was performed using
GATK4 GenotypeGVCFs (v4.1.8.0) from the constructed GenomicsDB. The
produced joint VCF was recalibrated in variant quality by two steps.
The first step was the recalibration of the SNPs. After calculating the
exome specific recalibration score through GATK4 VariantRecalibrator
(v.4.1.8.0), it was applied to the joint VCF using GATK4 ApplyVQSR
(v.4.1.8.0). We used the training sets from the HapMap project
(broad_hg38_v0_hapmap_3.3.hg38.vcf.gz), Mills and 1000 G project
(broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz,
broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz) and dbSNP
(broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf) from GATK4 bucket
([351]https://console.cloud.google.com/storage/browser/genomics-public-
data/resources/broad/hg38/v0/). The second step was the recalibration
of indels. The same tools in the first step were used and the training
sets from Mills and 1000 Genome Project
(broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) and
the dbSNP genotyping calls
(broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf). The recalibrated
joint VCF was converted to annotated VCF via the GATK4 funcotator. The
4.2.0.0 version of GATK4 was again used at the annotation step to
utilize the latest annotation data source for GATK4 funcotator. The
latest pre-packaged data sources
(funcotator_dataSources.v1.7.20200521 g) was downloaded from the GATK4
bucket (gs://broad-public-datasets/funcotator/). All annotations were
performed in CANONICAL transcript-selection-mode.
Identification of copy number alterations
DNA somatic copy number variations (CNVs) were detected using CNVkit
v0.9.8^[352]71. We labeled the copy number status with the following
criteria: genes with an absolute copy number greater than or equal to 4
were labeled as “amplification”, and genes greater than or equal to 2.5
and less than 4 were labeled as “gain”. Similarly, genes with an
absolute copy number from 0.5 to 1.5 were labeled “heterozygous
deletion”, and those less than 0.5 were labeled “homozygous deletion”.
Copy number signature analysis and WGD detection
For copy number signature analysis and WGD detection, we used FACETS
v0.16.0^[353]72 to identify allele-specific copy number information.
Preprocessed paired tumor-normal BAM files and a VCF file of common and
germline polymorphic sites downloaded from
[354]https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
were used as the input for FACETS. Two samples were dropped out due to
quality problems; therefore, 228 samples in total were used for copy
number signature analysis and WGD detection. We defined samples as
‘WGD-positive’ if greater than 50% of their autosomal genome had an
absolute copy number greater than or equal to two.
Mutational signature analysis was performed using the COSMIC signature
database v3^[355]73 and R package Sigminer v2.1.5^[356]74, as
previously described^[357]75. Briefly, three copy number features,
including total copy number, segment length, and loss-of-heterozygosity
state were extracted and classified into 48 components that categorize
the continuous values regarding the value range and biological
significance. To decide the number of signature groups (or a
factorization rank), non-negative matrix factorization (NMF) was
performed using a tumor-by-component matrix with 50 runs checking 2 to
12 ranks. Based on the cophenetic score plot, we determined to use rank
seven for copy number variants. Each signature was nominated by the
COSMIC signature of the highest cosine similarity. Then, by performing
hierarchical clustering, we assigned the samples to one of the
signatures based on the consensus matrix.
RNA-seq data generation and preprocessing
Total mRNA-seq libraries were prepared using Library-TruSeq Stranded
Total RNA with Ribo-Zero H/M/R_Gold. Pooled libraries were run on the
Illumina NovaSeq6000 to generate an average of 200 million reads per
library with 100-bp pair-end. The raw Illumina BCL binary files were
converted into Fastq files using the Illumina package bcl2fastq. The
sequenced read data was checked for quality and adaptor/overrepresented
sequence (ORS) by FastQC (v0.11.9). There was no ORS that could be a
quality problem, and since the adapter sequence was less than 1% in all
samples, the additional trimming process was skipped. Fastq files were
aligned to the human reference GENCODE GRCh38.p13 v32 primary assembly
genome using STAR (v2.7.3a). After alignment, we sorted the output by
coordinates via Picard SortSam and checked duplicate reads using Picard
MarkDuplicates (v2.23.1). The reads that contain Ns in the cigar string
were split using GATK4 SplitNCigarReads. Regarding systematic errors
from the sequencing machine, base quality scores were recalibrated by
the GATK4 BQSR algorithm (v4.1.8.0). The aligned BAMs from STAR were
sorted by coordinates using Samtools Sort (v1.10). The Salmon algorithm
(v1.4.0) was applied to quantify transcript-level expression, and raw
read counts were produced. The output files (quant.sf) produced by
Salmon, quantified transcript-level estimates, were imported to R using
tximport and converted to a gene-level expression matrix. Then, we
estimated the size factors using the “median ratio method” with the
“estimateSizeFactors” function supported by DESeq2 R packages. With
estimated size factors, we normalized the gene-level read count matrix
using the “counts” function in DESeq2 R packages^[358]76, which divides
the read counts by the size factors.
Isoform expression analysis
Isoform scale expression was quantified from transcriptome data using a
pipeline including StringTie^[359]77. STAR-aligned BAM without going
through GATK4 SplitNCigarReads in RNA preprocessing was used.
Transcripts of each sample were assembled using StringTie
v2.1.7^[360]78. The first analyzed outputs were combined into a single
assembly using merge command of StringTie (stringtie –merge), creating
a single merged gtf containing the transcript of the entire sample.
After that, a secondary analysis was conducted to calculate the
expression in isoform level of each sample based on the merged
transcript assembly using the StringTie Ballgown command (stringtie
-eB). The novel isoform in the StringTie outputs were compared and
annotated using gffcompare v0.11.6^[361]79. Processed isoform gtfs are
made into a single count matrix using predDE.py, a python script
provided by StringTie.
The abundance or exon usage change between normal and cancer tissue was
analyzed and visualized using IsoformSwitchAnalyzeR pipeline.
IsoformSwitchAnalyzeR v1.17.4^[362]80 pipeline includes DRIMseq^[363]81
to find isoform switch by condition and extract the sequence of mRNA or
amino acid produced from the isoform. We analyzed predicted isoforms or
ORF lists with additional tools as CPAT^[364]82, PFAM^[365]83,[366]84,
SignalP^[367]85 and IUPred2A^[368]86.
Fusion gene analysis
To detect the fusion genes, RNA-seq fastq was aligned, and fusion
calling and filtering were performed. First, RNA-seq reads were mapped
to the GENCODE GRCh38.p13 v32 primary assembly genome using STAR
aligner v2.7.3a^[369]87. Unlike previous STAR mapping, ‘–chimOutType
WithinBAM’ was included in the STAR output to include chimeric read to
be suitable for use in arriba fusion calling. To increase sensitivity,
the parameters were adjusted as recommended by the author as follows:
* --chimSegmentMin 10
* --chimOutType WithinBAM SoftClip
* --chimJunctionOverhangMin 10
* --chimScoreMin 1
* --chimScoreDropMax 30
* --chimScoreJunctionNonGTAG 0
* --chimScoreSeparation 1
* --alignSJstitchMismatchNmax 5 -1 5 5
* --chimSegmentReadGapMax 3
We found fusion gene by applying Arriba v1.2.0^[370]88 to the STAR
mapped BAM file. Mismatches were discovered by comparing chimeric reads
with reference genome assembly fasta and annotation gtf. Additionally,
fusion genes had bad quality of frequently found healthy tissues were
excluded from the analysis using the hg38 fusion blacklist provided by
Arriba. Arriba outputs were labeled with sample name and tissue and
concatenated with one table.
From the Arriba output, fusion genes with confidence of “low” was
removed. At this time, the fusion genes appearing in normal tissues
were collected and used as a blacklist for tumor tissue fusion genes.
If the fusion genes found in tumor tissue were included in this
blacklist, they were excluded from the analysis. Furthermore, fusion
genes including known cancer-related genes (ALK, ROS1, RET, and
PTK2)^[371]89 were annotated as “known fusion” and fusion genes found
across several tumor samples as “recurrent fusion”. They were used for
downstream analysis. After filtering, we visualized the structure and
protein domain of fusion genes using the R script (”draw_fusion.R”)
provided by arriba.
Protein extraction and tryptic digestion
For in-depth proteomic experiments, fifty milligrams of cryopulverized
human NSCLC and NAT samples were homogenized in lysis buffer at a ratio
of about 300 μL lysis buffer for every tissue. The lysis buffer
consisted of 5% SDS, 50 mM TEAB (pH 7.55, Thermo Fisher Scientific,
USA, 90114), protease inhibitor cocktail (1:100; Thermo Fisher
Scientific, USA, 78430), 1 EA of PhosSTOP (Roche, Swiss, 4906845001),
and 20 μM PUGNAc (Sigma-Aldrich, USA, A7229). Tissue lysis was
performed with a couple of probe sonications using the Digital Sonifier
SFX 550(Branson, USA). Sonication time was set at 30 s with a cycle of
on-time 5 s and an interval of 3 s at 28%. Lysates were centrifuged at
14,500 g at 4 °C for 5 min, followed by measuring protein concentration
using the BCA assay (Thermo Fisher Scientific, USA, 23225), and kept at
-80 °C until used for analysis.
Protein lysates were reduced at 95 °C for 10 minutes with 20 mM
1,4-dithiothreitol (Roche, Swiss, 10708984001), followed by alkylation
with 40 mM iodoacetamide (Sigma-Aldrich, USA, I6125) in the dark at RT
for 30 minutes. 12% phosphoric acid (Sigma-Aldrich, USA, 695017) of
1/10 of the sample volume and 7 multiple volumes of S-trap binding
buffer (100 mM TEAB in 90% MeOH) were added sequentially for formation
of colloidal status. The solution was loaded on an S-Trap™ spin column
(Protifi, USA) and centrifuged to 4000 g for 30 s. The captured protein
was washed three times at 4000 g for 30 s with 400 μL of binding
buffer. For tryptic digestion, 125 μL of digestion buffer (50 mM TEAB)
containing Pierce Trypsin/Lys-C Mix (Thermo Fisher Scientific, USA,
[372]A41007) (w/w ratio of 1:25) was added and incubated at 37 °C for
6 h. The digested sample was sequentially centrifuged with 80 μL of
digestion buffer, 80 μL of 0.2% formic acid (Honeywell, USA, 94318) at
1000 g for 60 s, and 80 μL of 0.2% formic acid in 50% acetonitrile
(ACN) at 4000 g for 60 s. Eluted peptides were dried and stored at
−80 °C until the next process. Samples were desalted using a C18 spin
column (Harvard Apparatus, USA) and dried. A205 application of NanoDrop
One (Thermo Fisher Scientific, USA) was employed for peptide
concentration measurements. A baseline was established using 1 µL of
HPLC water, and dried samples were each diluted with HPLC water and
measured at 1 µL. For the common reference sample, 10 μg aliquots were
taken from the sample being analyzed, pooled together, and used for
each TMT batch.
Construction of the common reference pool
In-depth proteomic analyses for this study were organized as TMTpro 16
plex experiments. For comparative quantification between all samples
across experiments, a common reference (CR) sample was involved at the
134 N channel in each 16-plex. The CR is a protein mixture that
contains all samples analyzed in the TMT experiments. All subsequent
procedures, including digestion, were performed along with other
individual samples to minimize variation.
TMTpro 16-plex labeling
Peptides, 250 μg per sample (based on peptide level quantification with
NanoDrop One), were labeled with 16-plex TMT reagents (Thermo Fisher
Scientific, USA, [373]A44520) according to the manufacturer’s protocol.
For each peptide aliquot of an individual sample, 1 mg of labeling
reagent was used. Dissolve each sample in 50 μL of 100 mM TEAB (pH
8.5), add 20 μL of labeling reagent dissolved in anhydrous
acetonitrile, and incubate for 1 h with shaking. 1.8 μg of labeled
peptides from each channel were taken out and subjected to LC-MS/MS
analysis to confirm labeling efficiency before pooling. Label
efficiency criteria were set as having a minimum of 95% fully labeled
MS/MS spectra in each sample. To quench the reaction, 5 μL of 5%
hydroxylamine was added and incubated for 15 minutes. Samples were
pooled in each 16-plex experiment and sequentially desalted with
Sep-Pak C18 3 cc Vac Cartridge (Waters, USA) and dried.
Mid pH reverse-phase liquid chromatography fractionation
To minimize sample complexity, samples were fractionated by mid-pH
reversed phase (RP) separation using Shimadzu LC20A (Shimadzu, Japan)
with an analytical column (XBridge Peptide BEH C18 Column; 300 Å, 5 µm,
4.6 × 250 mm, Waters, USA) and a guard column (SecurityGuard cartridge
C18, 4 ×3.0 mm, Phenomenex, USA). We performed the peptide
fractionation at a flow rate of 0.5 mL/min. Mobile phases A and B were
10 mM TEAB and 10 mM TEAB in 90% ACN, and the sample was dissolved in
0.1% formic acid in 90 μL. The LC gradient of mobile phase B was 5% in
8 min, 40% in 65 min, 44% in 69 min, 60% from 74 min to 88 min, and 5%
in 90 min. We collected a total of 96 fractions from 8 min, pooling
them into 24 non-consecutive fractions every 0.91 min. We then
concatenated 95% of each fraction into 12 non-consecutive fractions for
post-translational modifications (PTMs) analysis.
Phosphopeptide enrichment
For the phosphopeptide enrichment, immobilized metal affinity
chromatography (IMAC) was used. 300 μL of Ni-NTA agarose bead slurry
(QIAGEN, Germany, 30410) was resuspended in the tube. The slurry was
spun down for 1 minute, and the supernatant was removed. The bead was
washed three times with 1 mL of HPLC water and then incubated with
1.2 mL of 100 mM EDTA (Sigma-Aldrich, USA, E7889) for 30 minutes by
end-over-end turning at RT. After three washes with 1 mL of HPLC water,
the beads were incubated with 1.2 mL of 10 mM FeCl[3] for 30 minutes
with end-over-end turning at room temperature. 12 fractions for PTM
analysis are reconstituted in 113 μL of 0.1% TFA in 50% ACN, and then
337 μL of 0.1% TFA in 95% ACN is added sequentially. Beads were
resuspended in 140 μL of 1:1:1 ACN:MeOH:0.01% acetic acid solution and
10 μL of beads were aliquoted into 12 tubes for each fraction. We
resuspended the beads in the sample solution to bind the
phosphopeptides, and then gently mixed them for 30 minutes at room
temperature. The supernatant was collected separately for the next
acetyl peptide enrichment step. Beads coupled with phosphopeptides were
dissolved in 180 μL of 0.1% TFA in 80% ACN and desalted using a C18
stage-tip to elute for LC-MS/MS.
Acetylpeptide enrichment
For the acetylpeptide enrichment, the PTMScan® Acetyl-Lysine Motif
[Ac-K] Kit (Cell Signaling Technology, USA, 13416) was used. IMAC
eluents were concatenated into four fractions and dried. Peptides were
dissolved in 1.4 mL of 1 x IAP buffer, in which 10 x IAP buffer (5.78 g
MOPS-NaOH, 1.461 NaCl, 0.8 g dibasic sodium phosphate, and 0.08 g
monobasic sodium phosphate) were adjusted to pH with acetic acid
( ~ 1.4 to 1.6 mL). Agarose beads bound to the acetyl-lysine motif
antibody were pre-washed a total of 4 times using 1 x IAP buffer under
ice and split into 2 tubes with equal volume. The peptides were
transferred to a tube and incubated at 4 °C for 3 hours with
end-over-end turning. Each tube was centrifuged (2,000 g, 30 sec,
4 °C), and the supernatant was separated, followed by washing the beads
twice with 1 mL of ice-cold PBS and three times with chilled HPLC
water. 100 μL of 0.15% TFA was added to beads coupled with acetyl
peptides and then incubated at RT for 10 minutes, gently mixed every
2-3 minutes, and then centrifuged to elute. We desalted the eluted
acetyl peptides using a C18 stage-tip (IMAC procedure) and dried them
after a total of two elution steps.
LC-MS/MS for proteomics analyses
For global proteomic analyses, the Ultimate 3000 RSLC nano system
(Thermo Fisher Scientific, USA) coupled with the Q Exactive HF-X hybrid
quadrupole-Orbitrap mass spectrometer (Thermo Fisher Scientific, USA)
was used. Trap column (Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um
x 2 cm, 5 μm, Thermo Fisher Scientific, USA) and analytical column
(EASY-Spray™ LC Columns, C18, 75 um x 50 cm, 2 μm, Thermo Fisher
Scientific, USA), which were heated to 50 °C to prevent
over-pressuring, were equipped with UHPLC separation. Mobile phase flow
rate was 0.3 μL/min, and solvents A and B were 97% water, 3% ACN, 0.1%
formic acid, and 90% ACN, 0.1% formic acid, respectively. The gradient
profile of solvent B was 2% in 12 min, 8% in 16 min, 25% in 140 min,
35% in 150 min, 85% from 155 min to 163 min, and 2% from 165 min to
185 min. Data-dependent acquisition was performed at a spray voltage of
1.5 kV. MS1 spectra were measured with a resolution of 120,000 and a
mass range of 350 to 1500 m/z. The AGC target of 3e6, maximum injection
time of 50 ms, and isolation window of 0.7 m/z were set. Top 15 most
abundant precursors per cycle were selected to trigger MS/MS. MS2
spectra were measured with a resolution of 45,000, 1e5 of the AGC
target, 120 ms of maximum injection time, 34% of collision energy, and
110 of the fixed first mass (m/z). The precursor charge state was 2-5,
the intensity threshold was 1e4 and the dynamic exclusion time was
45 s.
For PTM analysis, the Ultimate 3000 RSLC nano system (Thermo Fisher
Scientific, USA) coupled with the Exploris 480 orbitrap mass
spectrometer (Thermo Fisher Scientific, USA) was used. Trap column
(Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um x 2 cm, 5 μm, Thermo
Fisher Scientific, USA) and analytical column (EASY-Spray™ LC Columns,
C18, 75 um x 50 cm, 2 μm, Thermo Fisher Scientific, USA) were equipped
with UHPLC separation. The mobile phase flow rate was 0.3 μL/min, and
solvents A and B were 0.1 % formic acid in water and 0.1 % formic acid
in ACN, respectively. The LC gradient of solvent B was 2% in 14 min, 4%
in 17 min, 16% in 120 min, 25% in 145 min, 85% from 150 min to 158 min,
and 4% from 160 min to 185 min. Data-dependent acquisition was
performed at a spray voltage of 2.1 kV and the cycle time was set to
3 sec. MS1 spectra were measured with a resolution of 120,000 and a
mass range of 350 to 1500 m/z. The normalized AGC target (%) was 300,
and the maximum injection time was 50 ms. MS2 spectra were measured
with a resolution of 45,000, 7.5 e4 of AGC target, 120 ms of maximum
injection time, 38% of collision energy, and 0.7 m/z of isolation
window. Peptides were selected for a 2-6 precursor charge state with an
intensity threshold of 1e4. Peptides that triggered MS/MS scans were
dynamically excluded from further MS/MS scans for 45 sec, with a +/- 10
ppm mass tolerance.
The database search strategy
In addition to canonical peptides, to reliably identify variant,
post-translational modified, and novel peptides in global proteomics,
we used a multi-stage search strategy: 1) the identification of
canonical peptides (primary database search), 2) the identification of
variant and modified peptides by considering 2355 modifications in
Unimod, and 3) the identification of novel peptides. For each stage,
the FDR was calculated separately using a target-decoy strategy, and
identifications were obtained at 1% FDR. Only unidentified spectra from
the previous stage were subjected to the subsequent stage. Details of
each stage are described in the following sections: 1) “The Primary
Database Search”, 2) “Identification of Variant and Modified Peptides”,
and 3) “Identification of Novel Peptides”.
The primary database search
All MS raw files were analyzed using Proteome Discoverer v2.4 with the
SequestHT search engine^[374]90 against the UniProt human protein
database v2021.01 (97,795 entries), combined with 179 common
contaminant proteins. The search parameters were set to 10 ppm for
precursor mass tolerance, 0.02 Da for fragment mass tolerance,
fully-tryptic enzyme specificity allowing for up to 2 missed cleavages,
carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
modification ( + 304.207 Da) on the peptide N terminus and Lys residues
for static modifications, and oxidation ( + 15.995 Da) on Met residues,
Met-loss (-131.040 Da) on protein N-terminal Met residues and
deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic
modifications. The minimum length of a peptide was set at six residues.
For phosphoproteome and acetylproteome, the search parameters
additionally included phosphorylation ( + 79.966 Da) on Ser, Thr, and
Tyr residues and acetylation ( + 42.016 Da) on the protein N terminus
and Lys residues for dynamic modifications. The TMT modifications were
set as dynamic modifications for acetylproteome, to prevent
co-assignment of TMT and acetylation. All peptide spectrum matches
(PSMs) were subsequently rescored by Percolator^[375]91 and validated
at an estimated false discovery rate (FDR) of 1%. For phosphosite and
acetylsite localization, ptmRS^[376]92 was used, and modified sites
with ptmRS probability greater than 0.95 were regarded as confident.
Identification of variant and modified peptides
Using CustomProDB^[377]93, each patient’s germline and somatic
mutations were used to generate variant protein sequences. The variant
protein sequences were merged for each batch and combined with 16,130
UniProt human proteins identified from the primary database search and
179 common contaminant proteins to generate combined customized
databases.
All MS raw files were converted to MGF files using msconvert v3.0.1,
and the precursor m/z values were replaced with the recalibrated values
by Proteome Discoverer in the primary database search (this
recalibrated MS2 spectra were used in all subsequent searches). The
unidentified MS2 spectra from the primary database search were searched
against the combined customized databases using MODplus v1.02^[378]94.
The search parameters were set to 10 ppm for precursor mass tolerance,
0.01 Da for-fragment mass tolerance, semi-tryptic enzyme specificity
allowing for up to 2 missed cleavages, -1 to +2 for 13 C isotope error,
carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
modification ( + 304.207 Da) on the peptide N terminus and Lys residues
for static modifications. All modifications in Unimod database v2020.10
(2355 entries) including all amino acid substitutions were considered
for dynamic modification (multiply modified peptides were allowed
within the modified mass range of -150 to +350 Da). All PSMs were
validated at an estimated FDR of 1% using MODplus FDR toolkit.
To identify neoantigen candidates, the proteomics-supported somatic
mutations were filtered by examining the intensity of TMT reporter ion
matched to the patient for whom the somatic mutation was called (the
corresponding reporter ion intensity must account for at least 20% of
the total intensity of all reporter ions). The immunogenicity of
filtered somatic mutations was also evaluated by predicting binding
affinity to human leukocyte antigen (HLA)-I molecules.
Identification of novel peptides
All unidentified MS2 spectra from the primary and subsequent variant
database searches were analyzed to identify novel peptides originating
from pseudogenes, long noncoding RNAs (lncRNAs), untranslated regions
(UTRs), and novel isoforms (including fusion genes). For pseudogenes,
lncRNAs, and UTRs whose RNA transcripts’ FPKMs are greater than 0, the
RNA sequences were selected from the GENCODE transcript fasta files
v32, and their three frame translations were generated for each
patient, except for UTRs, whose translations were generated by the UTR
sequence database construction method^[379]55, which assumes that
alternative cognate and near-cognate start codons and translational
readthrough in the stop codons can result in abnormal translation.
These noncoding peptide sequences were merged for each batch and
matched to MS2 spectra by MS-GF+ v2021.09^[380]95 with the following
search parameters: 10 ppm for precursor mass tolerance, tryptic enzyme
specificity allowing up to 2 missed cleavages (also permitting
non-enzymatic terminals), no 13 C isotope error allowed,
carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
modification ( + 304.207 Da) on the peptide N terminus and Lys residues
for static modifications, oxidation ( + 15.995 Da) on Met residues and
deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic
modifications, and TMT protocol. The minimum length of a peptide was
set at 8 residues. All PSMs were validated at an estimated FDR of 1%.
The novel transcripts, including fusion genes, were predicted by
StringTie v2.1.7 (class code =, c, k, m, n, and j)^[381]78,[382]96 and
Arriba v1.2.0^[383]88, respectively, and their three frame translations
were generated to compose the novel protein isoform sequences for each
patient. The novel isoform sequences were merged for each batch and
matched to MS2 spectra using Comet v2021.02.0^[384]97 with the
following parameters: 10 ppm for precursor mass tolerance, semi-tryptic
enzyme specificity allowing up to two missed cleavages, no isotope
error allowed, carbamidomethylation ( + 57.021 Da) on Cys residues and
TMT modification ( + 304.207 Da) on the peptide N terminus and Lys
residues for static modifications, and oxidation ( + 15.995 Da) on Met
residues and deamidation ( + 0.984 Da) on Asn and Gln residues for
dynamic modifications. The minimum length of a peptide was set at 8
residues. All PSMs were subsequently rescored by Percolator and
validated at an estimated FDR of 1%.
We rejected PSMs conflicting with both MS-GF+ and Comet (i.e.,
identical spectra but different peptides assigned). The remaining novel
peptides were searched using BLAST^[385]98 and filtered out if there
were peptide sequence matches in the Uniprot (v2022.03, 226,999
entries), RefSeq (v2022.09, 130,184 entries) and/or Gencode (v41,
110,224 entries) human protein sequences, allowing no more than a
single amino acid substitution. We also removed the frameshift peptides
to concentrate solely on novel peptides derived from noncoding regions
and novel isoforms. Finally, we used peptides supported by at least one
RNA-Seq read and a quantifiable PSM to retrieve unique peptides at the
gene level. In total, we obtained and used 1045 novel sequences for
further analysis.
Quantification of Protein, Phosphosite and Acetylsite
Peptide abundances were normalized to have the same total amount of
peptide abundances between TMT channel values in the same batch by
Proteome Discoverer v2.4 with the SequestHT search engine^[386]90
against the UniProt human protein database v2021.01 (97,795 entries),
combined with 179 common contaminant proteins. The co-isolation
threshold 50% and the average reporter S/N threshold 10% were used to
filter out quantification values of low quality. For global proteome
quantification, proteins with at least one unique and/or razor peptides
were exported, and normalization between batches was achieved by
dividing each TMT channel value by that of the common reference (CR)
channel in the same batch. The log[2] ratio value was used for the
subsequent quantification analysis.
The quantification of phosphosites and acetylsites was performed using
the same method: the sum of peptide abundances containing each
modification site was calculated to represent modification site
abundance, and each TMT channel value was normalized using the same
method as the global proteome data. After applying two-component
normalization of TMT ratios for each proteomics dataset^[387]99, the
log[2] ratio value was used for the subsequent quantification analysis.
We selected features that have less than 30% missing values across all
tumors for the global proteomics, phosphoproteomics, and
acetylproteomics dataset using preprocessed normalized log[2] ratio
values. After then, we performed k-nearest neighbor imputation (k = 5)
for missing values using the impute R package (10.18129/B9.bioc.impute)
for the subsequent quantification analysis.
NMF clustering analysis for multiomics subtypes
To identify patient subtypes, we performed non-negative matrix
factorization (NMF) analysis on global proteins and PTM sites as
instructed in previous CPTAC studies^[388]10,[389]13. We concatenated
preprocessed imputated global proteomics, phosphoproteomics, and
acetylproteomics dataset into a single matrix. Then, we excluded
features with the lowest standard deviation (bottom 5th percentile),
followed by scale and z-score transformation. Using the NMF
R-package^[390]100, we performed the NMF analysis for factorization
ranks from 2 to 10 using the standard NMF algorithm from Brunet et
al.^[391]101 and 2000 max iterations for convergence and repeated 50
times to compute clustering statistics. We determined the optimal
factorization rank k, representing the number of clusters, from the
rank with the maximal cophenetic correlation coefficient value and its
drastic decrease. With the optimal factorization rank k, we sought to
robust clusters from the NMF analysis using 200 runs and 5000 max
iterations. From decomposed NMF matrices, we assigned samples to NMF
clusters corresponding to the optimal factorization rank k and obtained
features specific to each cluster according to the row-wise feature
score^[392]102. We defined a “core sample” for those having a cluster
membership score ≥ 0.5 as described previously^[393]10,[394]13.
Combined NMF and Comparison of NMF subtypes with previous NSCLC subtypes
To assess the concurrence between our subtypes and lung cancer subtypes
defined through recent multi-omics studies, we conducted NMF analysis
by integrating our study cohort with previous CPTAC study cohort, which
contains LUAD^[395]10 and LSCC^[396]13. Prior to running NMF, each
study was subjected to pre-processing using the methodology outlined in
the “NMF clustering analysis for multiomics subtypes” section. Features
pertaining to proteomics, phosphoproteomics, and acetylproteomics
datasets were examined, and only those identified as commonly present
in all three studies were selected and concatenated into a single
matrix for NMF clustering. Using the NMF R-package, we sought to robust
cluster from the NMF analysis using 200 runs and 5000 max iterations
with rank 5. We analyzed the subtypes derived from NMF decomposition
matrices for each individual study within the study cohort samples of
LUAD and LSCC and compared them with the subtypes identified in the
combined NMF results.
To validate the robustness of our subtype classification in a recent
NSCLC study^[397]10,[398]11,[399]13,[400]14 comprising 462 patients, we
conducted a comparative analysis of subtype features across multiomics
datasets included in each individual study. To identify the features
associated with the subtypes of each study, we performed re-clustering
of NMF features using the omics datasets and parameters employed in
prior studies and selected features utilizing the “Max”
method^[401]102. Gillette study was conducted with RNA, global-,
phospho- and acetyl-proteomics datasets for NMF, and used 4 ranks, 200
runs and 5000 max iterations. Satpathy study was conducted with CNV,
RNA, global-, phospho-, and acetyl-proteomics dataset for NMF and used
5 ranks, 500 runs, and 5000 max iterations. Also, Lehtio study was
conducted with only global proteomics dataset for NMF, and used 6
ranks, 100 runs, and 5000 max iterations. For Xu’s (2020) study, the
up-regulated protein list for each subtype as presented was utilized as
a subtype-specific features. To confirm a statistically significant
relationship, an odds ratio and p value were calculated by conducting a
Fisher’s exact test in R between each study cluster’s features.
Gene set enrichment analysis
To obtain characteristic analysis information corresponding to NMF
subtype based on the expression data of multi-omics of tumor samples,
we performed PTM signature Enrichment Analysis (PTM-SEA^[402]103) for
phosphoprotoemics dataset and R package GSVA^[403]104 for global-,
phospho- and acetyl protoemic datasets.
For GSVA, pathways from hallmark, KEGG, Reactome, GO and Wikipathway
databases which were downloaded from MSigDB were considered and we used
only pathways which contains more than 200 genes and lower than 1000
genes in each pathway.
database: “h.all.v2023.1.Hs.symbols.gmt.txt”,
“c2.cp.kegg.v2023.1.Hs.symbols.gmt.txt”,
“c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt”,
“c2.cp.wikipathways.v2023.1.Hs.symbols.gmt.txt”,
“c5.go.v2023.1.Hs.symbols.gmt.txt”
For PTM-SEA, flank amino acid sequence (+/- 7 aa) was added to PTM data
as primary identifier and used. We computed normalized enrichment
scores (NES) of gene sets. We used the implementation which contains
PTM-SEA available on GitHub
([404]https://github.com/broadinstitue/ssGSEA2.0) using the command
interface R-script (ssgsea-cli.R) using the following parameters:
* database:”ptm.sig.db.all.flanking.human.v1.9.0.gmt”
* sample. norm. Type: “rank”
* output. score. Type = “NES”
* nperm = 100
For identifying NMF subtype feature pathway, we select Hallmark pathway
gene sets (MSigDB), which summarize specific cancer-related biological
states or processes for ssGSEA and PTM signatures database (PTMsigDB)
containing a list of modification site-specific signatures.
Identification of differentially enriched copy number variations
Gene-level copy numbers detected by CNVkit were converted to
cytoband-level by calculating the mean copy numbers in each cytoband.
Then we used a linear regression model to identify differentially
enriched copy number variations in each NMF Subtype. To control the
effect of the pathological diagnosis classification (DX), we set DX as
a covariate.
[MATH:
Cytoban<
mi>d−levelcopynumbe
mi>rs=β0+β1
⋅Subtyp<
/mi>e+β<
mrow>2⋅DX+ε :MATH]
β[0] and β[1] indicate the intercept and the log2FC value,
respectively. Cytobands with an adjusted p-value (calculated using
Benjamini–Hochberg method) less than 0.05 were selected as significant
CNVs.
Identification of differentially expressed genes (DEGs)
To identify differentially expressed genes between conditions, we used
“DESeq” function provided by DESeq2 R packages^[405]76. Quantified
transcript-level estimates from Salmon were imported to R using
“tximport” and were converted to a gene-level expression matrix. With
this gene-level expression matrix, DESeqDataSet was generated using
“DESeqDataSetFromTximport” function. Then, differential expression
analysis based on the negative binomial distribution was performed
using “DESeq” function. Notably, we conducted the DEG analysis using
the formula below.
[MATH:
design=~Grou
mi>p+DX+rna.b
mi>atch :MATH]
In this model, Group represent the conditions under comparison. DX is
the name of the pathological diagnosis classification, including LUAD
or LSCC. Batch information for RNA (rna.batch) was also used as a
covariate in DEG analysis to remove batch effects from DEG analysis.
The adjusted p-value was calculated using the Benjamini-Hochberg
method, and an adjusted p-value less than 0.05 was used as a criterion
for differentially expressed genes.
Identification of differentially expressed proteins, phosphorylation, and
acetylation (DEP)
We identified differentially expressed proteins, phosphorylation, and
acetylation using linear regression models as below.
[MATH: Y=β0
mrow>+β1⋅(Gr
oup)+β2⋅DX+β2⋅Batch+ε :MATH]
y[i] is a vector of dependent variables that represents the expression
values of each proteomic feature. Group is a vector of independent
variables that represents the state we are trying to compare. DX is the
name of the pathological diagnosis classification, and Batch is a TMT
channel number that represents batch information. β[0] is intercept,
β[1] is beta coefficient for Group variables, and represents the
log[2]FC value. The linear regression’s calculated p-value tests the
null hypothesis that the coefficient is equal to zero, indicating no
difference between groups. The adjusted p-value was calculated using
Benjamini–Hochberg method, and an adjusted p-value less than 0.05 was
used as a criterion for differentially expressed proteins,
phosphorylation, and acetylation.
Survival analysis
To compare the survival probability across our five NMF subtypes, we
measured overall survival (OS) length (Supplementary Data [406]1a), the
time from the date of tumor resection surgery to the time of
NSCLC-induced death, for 229 patients. With NSCLC-death, we performed a
survival analysis based on Kaplan–Meier estimation. In the Kaplan–Meier
estimation model, NSCLC-death was used as an indicator of the ending
time of the OS length. It is labeled as a right-censored observation in
the case of patients with no deaths, which causes NSCLC at the end of
the study. For patients who have died but whose cause of death is not
NSCLC, we can use partial information that they survived beyond a
certain point, but the exact date of NSCLC-death is uncertain.
The Kaplan-Meier estimation was used to measure survival curves for
each subtype. The Kaplan-Meier survival curve is defined as the
probability of surviving for each length of time after tumor resection
surgery while considering time at many small intervals. For each time
interval, survival probability is calculated as the number of patients
who survived divided by the number of patients at risk. Patients who
have died, dropped out, or moved out are not counted as “patients at
risk”. The total probability of survival until that time interval is a
cumulative probability, calculated based on the law of multiplication
of probability by multiplying all the probabilities of survival at all
time intervals until a certain point^[407]105. We used the log-rank
test to statistically test the null hypothesis that there is no
difference between the survival curves of each subtype. For the
log-rank test, the total number of observed events in each subtype,
i.e., O[1] and O[2] was used, and the expected number of events in each
group, i.e., E[1] and E[2], was calculated. The total number of
expected events is calculated as the sum of the expected number of
events at the time of each event in any of the subtype, bringing all
subtypes together. The expected number of events at the time of each
event is the result of multiplying the total number of patients
surviving at the time of events in all subtypes by the risk of events
at time^[408]105. Log-rank test statistic is calculated as below.
[MATH:
Log−ran<
mi>kteststati
mi>stic=O1−E12E1
+O2−E22E2
:MATH]
Since there are five subtypes in total, the pooled p value is
calculated and shown in Fig. [409]2g. We perform pairwise comparison
between Subtypes to decide which Subtypes represent poor prognosis
subtype of NSCLC and adjusted p-value was calculated using the
Benjamini–Hochberg procedure.
To identify the molecular features that result in survival probability
differences, we also performed feature-wise survival analyses comparing
survival curves between group of patients with top 50% expression and
bottom 50% expression for each protein and PTM features based on
Kaplan-Meier estimation. The prognostic direction, which shows whether
the prognosis is favorable or unfavorable, was determined by the sign
of coefficient from Cox proportional-hazard (CPH) model not from
Kaplan-Meier model. Since in many cases, the survival probability does
not reach 0.5, it was difficult to compare the median survival time
that can be used as an indicator of prognostic direction in the
Kaplan–Meier model. CPH model was used for univariate analysis and
categorical analysis. h(t) is hazard function determined by the
univariate, high- or low expression group. h[0] is called the baseline
hazard. It corresponds to the value of the hazard if x[1] is equal to
zero. The coefficient b[1] measure the impact of univariate and tells
us prognostic direction by sign. x[1] is categorical variable, low- or
high expression group. When sign of the coefficient b[1] for high
expression group is positive, its feature classified as unfavorable
prognostic features and vice versa.
[MATH: ht=h
mi>0t×exp<
mo>(b1x1) :MATH]
For significance, a log-rank test was used, and the nominal
p-value < 0.001 was used as a criterion for survival-associated
features. The results of the feature-wise survival analysis are listed
in Supplementary Data [410]2d. We selected the top 1% of features by
statistical significance for survival-associated features.
Kinase activity estimation based on phosphoproteomic data
To understand the Subtype-specific kinase activities, we used decoupleR
R package to infer kinase activity from the results of differentially
expressed phosphorylation (DEPP) using prior knowledge network, such as
databases for kinase/phosphatase and substrate interactions^[411]106.
We used prior knowledge network provided by PHONEMeS^[412]107. Also, we
set minimum size of regulons to 5, and provided 1000 deregulated
phosphorylation (each 500 phosphorylation for up-, down-regulated).
There are a variety of methods to infer kinase activities and we used
multivariate linear model (MLM)-based method as below:
[MATH: Y=βX+ψ :MATH]
where the dependent variable Y represents the t-statistic measurements
of phosphorylation from the results of the differentially expressed
phosphorylation analysis, and the independent variable X is the
connectivity matrix representing the associations with kinases. X[ij]
equals 1 when phosphosite i is a known substrate of the kinase j, 0
otherwise. Ψ represents the normally distributed error of the fit and β
represents the scores of the kinase activity^[413]108. In decoupleR
packages in R, we used “run_mlm” function to estimate kinase activity.
P-value was obtained from a multivariate linear model and the adjusted
p-value was calculated using the Benjamini–Hochberg adjustment.
Single-cell specific subtype distribution
To assess the tumor microenvironment in five NSCLC subtypes by
comparing subtype-specific genes with a diverse set of cell
type-specific genes, using the dataset ‘local_extended.rds’ obtained
from [414]https://luca.icbi.at. Subtype-specific gene sets were
generated based on the results of differential gene expression analysis
(DEGs), including only genes with significantly increased fold change
and a significant adjusted p-value (<0.05).
For the tumor versus tumor comparison, we considered only the cell
type-specific genes originating from ‘tumor_primary’ cells. Conversely,
for the tumor versus normal adjacent tissue (NAT) comparison, we used a
subset of cell type-specific genes from both ‘tumor_primary’ and
‘normal_adjacent’ cells.
Clustering analysis for immune microenvironment
We used transcriptome data of NSCLC patients to define the subtypes of
tumor immune microenvironment. Two clustering approaches were used,
which were based on the enrichment score of cell types^[415]41 and
immune-related pathways^[416]42. The cell type-based clustering was
performed with 205 tumors and 85 normal samples. The enrichment scores
of 64 cell types were inferred by Xcell^[417]41, which performed gene
set enrichment analysis based on the curated gene signatures for each
cell type. The inferred cell enrichment scores were then normalized as
z-score and consensus clustering was performed using R package
CancerSubtypes^[418]109. After the consensus clustering, the
partitioning around medoids (PAM) algorithm was used to divide the
three clusters. We first assigned a NAT-enriched cluster among the
three clusters when it was mainly matched to normal samples. After
that, the immune score calculated by Xcell was used to determine the
HTE and CTE immune clusters, so that the cluster with the higher immune
score was defined as having an HTE tumor. We also performed
pathway-based clustering with 205 tumor samples. For that, GSVA was
performed across the patients based on seven curated immune-related
pathways. The enrichment scores of the pathways were normalized as
z-scores, and k-means clustering was performed based on the z-scores to
identify two immune clusters. The cluster having a higher pathway
enrichment score was defined as HTE tumor.
Survival and regression analysis with immune cluster and cell types
The enrichment score of cell type and the status of the immune cluster
across the patients were tested for their correlation with a set of
clinical and molecular features. The pattern of TILs was examined by
regression analysis using MASS package and GLM package of R. Overall
survival and relapse-free survival were tested for their correlations
with cell types and immune clusters by Cox proportional hazards
regression analysis using CoxPHFitter package of Python. All analyses
were performed with clinical histology and sample batches as
covariates.
Identifying putative regulators associated with immune landscape
To identify putative regulators associated with immune landscape, we
first inferred protein activity using transcriptome data of the
patients. For the protein activity, we obtained regulon networks of
lung cancer from ARACNe package^[419]110 of R and calculated protein
activity of each patient using viper tool^[420]44 that inferred how
each protein regulates its target genes based the regulon network. The
positive value of protein activity indicates that the protein
positively regulates its target genes to be overexpressed, and the
negative value indicates vice versa. We then calculated the correlation
between the protein activity of immunomodulators and the enrichment
score of cells, or the status of the immune cluster. The RNA and
protein expression of the same immunomodulators was also analyzed for
their associations with the immune landscapes. To corroborate the
putative regulation of the immune landscape, we further analyzed
whether a specific driver mutation was involved in the association
between immunomodulators and immune cells or immune clusters. The
curated driver mutations were derived from Oncovar^[421]111,
DriverDBv3^[422]112, Intogen^[423]113, and mutation catalogue from
Martínez-Jiménez et al.^[424]114.
Immune landscape analysis with the integrated cohort
To test the reproducibility of our analysis about the immune landscape
across NSCLC patients, we used two independent NSCLC cohorts of
Satpathy et al.^[425]13 and Gillette et al.^[426]10. The z-score
normalization was performed for the enrichment scores of 64 cell types
in each cohort that had 202 (Satpathy et al.), 211 (Gillette et al.),
and 290 (our cohort) patients. The normalized enrichment scores of cell
types were then integrated into a single cohort and performed consensus
clustering. PAM algorithm was used to make three clusters and they were
mapped to NAT, HTE, and CTE, respectively, according to the method used
in our original cohort.
Neoantigen and cryptic peptide prediction
We predicted different types of neoantigens and cryptic peptides
according to their origin and validation (Supplementary Fig. [427]6a).
Firstly, canonical neoantigens were predicted from the mutated peptides
that had a length between 9mer and 12mer amino acids including somatic
mutations. The binding affinity of the mutated peptides was predicted
using MHCflurry^[428]115 and NetMHCpan^[429]116 for the patient HLAs
that were identified by OptiType^[430]117. Only the mutated peptides to
be predicted as binding to the patient HLAs were defined as the
potential neoantigen. Among the neoantigen candidates, we further
defined “confirmed neoantigens” when they had evidence of MS
experiments (see the section “Identification of variant and modified
peptides”). The remaining two types were defined by non-canonical
peptides that originated from unconventional translation of
pseudogenes, lncRNAs, UTRs, or novel isoform transcripts (Refer to
section “Identification of novel peptides”). We identified novel
isoform transcripts that were expressed only in tumor samples when they
matched normal samples. The mean expression of normal samples was used
when the tumor sample did not match normal. We treated a degree of FPKM
below 1 as indicating no expression. Only the novel peptides that
occurred in more than two tumor samples and not in at least one normal
sample were predicted as “cryptic peptides” using the same methodology
for neoantigen candidates in the prediction of binding to HLAs. We
further determined “confirmed cryptic peptide” when the cryptic
peptides also showed substantial expression (FPKM > = 1) in the tumor
samples as described in the previous study^[431]118.
Analysis of multi-omics subtype distribution according to the cryptic peptide
load and immune landscape status
We analyzed the distribution of the patients of each multi-omics
subtype across the status of cryptic peptide load and immune landscape.
Two features were chosen as the surrogates of the immune landscape,
which were the status of the immune cluster (HTE/CTE) and the antigen
processing and presentation machinery (APM). The patients were first
separated into two groups that had a high and low number of cryptic
peptides. They were also divided into two other groups: HTE and CTE, or
high APM and low APM. The status of APM was determined based on the
pathway enrichment score of “Antigen_Processing_and_Presentation” in
KEGG pathway^[432]119. We created four categories based on the status
of cryptic peptide load and immune landscape, and the degree of
over-representation for each multi-omics subtype was calculated by a
two-sided Fisher’s exact test.
Patient-derived lung tumoroid culture
Patient-derived lung tumoroid culture and drug screening were performed
by SG Medical, Inc. (Seoul, Korea) and conducted as previously
described^[433]120. Briefly, tissues were kept in a cold Hank’s
balanced salt solution (HBSS) with antibiotics (Gibco, OK, USA) and on
ice after dissection. Samples were washed three times with cold HBSS
with antibiotics and sectioned with sterile blades. Sectioned samples
were incubated with 0.001% DNase (Sigma-Aldrich, MO, USA), 1 mg/ml
collagenase (Roche, IN, USA), 200 U/ml penicillin, and 200 mg/ml
streptomycin in DMEM/F12 medium (Gibco, OK, USA) at 37 °C for 1 h with
intermittent agitation. After incubation, the suspensions were
repeatedly triturated by pipetting and passed through 40-μm cell
strainers (BD Falcon, CA, USA). The strained cells were centrifuged at
112 × g for 3 min, and the pellet was resuspended in lung tumoroid
culture media (DMEM/F12 supplemented with 20 ng/ml of bFGF (Invitrogen,
CA, USA), 50 ng/ml human EGF (Invitrogen), N2 (Invitrogen), B27
(Invitrogen), 10 μM ROCK inhibitor (Enzo Life Sciences, NY, USA), and
1% penicillin/streptomycin (Gibco, OK, USA). The suspension was mixed
with matrigel and seeded onto 6-well plates. Culture media were
replaced every 4 days.
Drug screening
Lung tumoroids cultured were harvested and dissociated using TrypLE
Express (Gibco, OK, USA). The dissociated lung tumoroids were diluted
in a lung tumoroid culture media-matrigel mixture and seeded onto
384-well plates (250 cells per well). After lung tumoroid generation, 8
concentrations of Selinexor (Selleckchem, TX, USA) and vehicles (DMSO)
as a negative control were added in triplicate. After 6 days,
quantification of cell viability was done by adding 10 μl of
CellTiter-Glo 3D (Promega) to each well according to the manufacturer’s
instructions on a Varioskan LUX Multimode Microplate Reader (Thermo
Fisher Scientific, MA, USA). The determination of IC50 values was
conducted using GraphPad Prism.
CPTAC NSCLC data download and preprocessing
Clinical data, MAF file, and gene-level CNV data are downloaded from
the supplementary tables provided by previous studies^[434]10,[435]13.
Segment-level CNV data is downloaded from the GDC data portal
([436]https://portal.gdc.cancer.gov), and global proteome,
phosphoproteome, acetylproteome, and clinical data, including survival
information, were downloaded from LinkedOmics
([437]http://www.linkedomics.org). For global, phospho-, and
acetyl-proteomics data, we performed the imputation in the same way as
in our cohort. Briefly, features with more than 30% missing values
across all tumors were discarded, and k-NN imputation using k = 5 was
performed. We conducted all downstream analyses using these data in the
same manner as we did with our cohort.
Reporting summary
Further information on research design is available in the [438]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[439]Supplementary Information^ (6.2MB, pdf)
[440]Peer Review File^ (4.8MB, pdf)
[441]Reporting Summary^ (1.1MB, pdf)
[442]41467_2024_54434_MOESM4_ESM.pdf^ (181.7KB, pdf)
Description of Additional Supplementary Files
[443]Supplementary Data 1^ (317.3KB, xlsx)
[444]Supplementary Data 2^ (1.6MB, xlsx)
[445]Supplementary Data 3^ (12.8MB, xlsx)
[446]Supplementary Data 4^ (231.9KB, xlsx)
[447]Supplementary Data 5^ (308.8KB, xlsx)
[448]Supplementary Data 6^ (480.6KB, xlsx)
[449]Supplementary Data 7^ (120MB, xlsx)
Acknowledgements