Abstract
Identification of risk biomarkers may enhance early detection of
smoking-related lung cancer. We measured between 392 and 1,162 proteins
in blood samples drawn at most three years before diagnosis in 731
smoking-matched case-control sets nested within six prospective cohorts
from the US, Europe, Singapore, and Australia. We identify 36 proteins
with independently reproducible associations with risk of imminent lung
cancer diagnosis (all p < 4 × 10^−5). These include a few markers (e.g.
CA-125/MUC-16 and CEACAM5/CEA) that have previously been reported in
studies using pre-diagnostic blood samples for lung cancer. The 36
proteins include several growth factors (e.g. HGF, IGFBP-1, IGFP-2),
tumor necrosis factor-receptors (e.g. TNFRSF6B, TNFRSF13B), and
chemokines and cytokines (e.g. CXL17, GDF-15, SCF). The odds ratio per
standard deviation range from 1.31 for IGFBP-1 (95% CI: 1.17–1.47) to
2.43 for CEACAM5 (95% CI: 2.04–2.89). We map the 36 proteins to the
hallmarks of cancer and find that activation of invasion and
metastasis, proliferative signaling, tumor-promoting inflammation, and
angiogenesis are most frequently implicated.
Subject terms: Predictive markers, Non-small-cell lung cancer, Tumour
biomarkers
__________________________________________________________________
Lung cancer screening could enhance early diagnosis and treatment.
Here, the authors used proteomic analysis of pre-diagnosis samples
across 6 cohorts to identify 36 proteins associated with imminent lung
cancer diagnosis.
Introduction
Lung cancer is the leading cause of cancer death globally^[22]1. The
5-year survival is 20%, but varies from 60% for early-stage disease
(Stage 1-2) to 6% for late-stage disease (stage 4)^[23]2. In the United
States (US), lung cancer mortality declined by 6% annually from 2013 to
2016^[24]3. This improvement can be attributed to advancements in
diagnosis and treatment for patients with both early- and late-stage
lung cancer^[25]4. Improved surgical techniques, including stereotactic
body radiotherapy (SBRT) and adjuvant chemotherapy, have improved
prognosis for early-stage patients, whereas patients with locally
advanced disease have benefitted from the introduction of
radio-chemotherapy, adjuvant immunotherapy, and neoadjuvant immune
checkpoint inhibitors (ICIs). However, most lung cancer patients are
diagnosed with late-stage disease where curative treatment is rarely
possible, even though developments in targeted and immunotherapy
combinations have improved short-term survival^[26]4.
Despite advances in lung cancer treatment, improving early detection is
the most promising strategy to improve long-term survival. Screening
with low-dose computed tomography (LDCT) has the potential to
substantially increase the proportion of lung cancer patients diagnosed
with early-stage disease who can be offered treatment with curative
intent. The ability of LDCT screening to decrease lung cancer mortality
among high-risk people with a history of smoking has been demonstrated
in several randomized trials^[27]5,[28]6, but some concerns remain,
including how to best identify and reach those individuals who are
likely to benefit from screening, and how to manage indeterminate
pulmonary nodules detected on LDCT.
The advent of LDCT screening and the introduction of targeted therapies
have highlighted a need to identify lung cancer biomarkers that can be
used to (i) identify high-risk individuals who may benefit from
screening, (ii) inform diagnostic work-up and nodule management after
LDCT screening, and (iii) choose optimal treatment regimens and monitor
response to treatment. In 2018, the US National Cancer Institute funded
the Integrative Analysis of Lung Cancer Etiology and Risk (INTEGRAL)
program, an ambitious initiative focusing on developing biomarkers that
can refine eligibility criteria for LDCT screening and diagnostic
work-up following LDCT^[29]7. Here, we present results from the initial
large-scale analysis designed to identify circulating protein
biomarkers associated with imminent lung cancer diagnosis in the
general population of individuals with a smoking history. Using a
high-throughput proteomics approach, we screened over 1000 circulating
proteins in blood samples drawn up to three years prior to diagnosis
within the Lung Cancer Cohort Consortium (LC3).
We here focus on identifying proteins robustly associated with risk of
imminent lung cancer diagnosis, and then describing their
epidemiological properties, the biological pathways to which they
belong, and their known relevance in carcinogenesis.
Results
Our study was designed to identify protein markers of imminent lung
cancer in people with a smoking history from the general population. We
defined imminent lung cancer as a clinical lung cancer diagnosis within
three years of blood draw and identified 731 lung cancer cases and 731
smoking-matched controls in six prospective cohort studies from the LC3
consortium.
Most study participants were men (980 men vs. 482 women) and the mean
age at blood collection was 65 years (standard deviation 9 years). The
mean time between pre-diagnostic blood collection and diagnosis was 1.6
years (range: 0–3 years, by design) (Table [30]1). Demographic
characteristics stratified by cohort are presented in Supplementary
Data [31]1.
Table 1.
Characteristics of 731 lung cancer cases and 731 matched controls from
the Lung Cancer Cohort Consortium included in analyses to identify
protein biomarkers of imminent lung cancer diagnosis
Characteristic Cases Controls
N (%) or mean (SD) N (%) or mean (SD)
Total number of participants 731 731
Female 241 (33%) 241 (33%)
Age, years 64.8 (9.1) 64.7 (9.2)
Body mass index, kg/m^2 25.5 (4.2) 26.2 (4.3)
Follow-up time, years^a 1.6 (0.9) 11.9 (5.4)
Follow-up survival time, years^b 4.1 (4.1) 13.0 (5.4)
Smoking status
Current 397 (54%) 400 (55%)
Former 334 (46%) 331 (45%)
Cigarettes smoked per day 20.9 (13.3) 16.3 (11.7)
Years smoked 39.5 (12.2) 36.2 (14.0)
Years since cessation, among former 15.4 (11.7) 19.0 (13.6)
Histology
Adenocarcinoma 246 (34%)
Squamous cell carcinoma 150 (20%)
Large cell carcinoma 27 (4%)
Small cell carcinoma 118 (16%)
Other/NOS 190 (26%)
Stage
Early stage (TNM 1/2) 78 (23%)
Late stage (TNM 3/4) 256 (77%)
Unknown/missing 397
Participating cohort
CPS 115 (16%) 115 (16%)
EPIC 188 (26%) 188 (26%)
HUNT 164 (22%) 164 (22%)
MCCS 108 (15%) 108 (15%)
NSHDS 64 (9%) 64 (9%)
SCHS 92 (12%) 92 (12%)
[32]Open in a new tab
^aTime from blood draw to end of follow-up or lung cancer diagnosis.
^bTime between blood draw and the end of follow-up for mortality
(including death).
Age, body mass index, and smoking information is assessed at the time
of blood draw.
EPIC The European Prospective Investigation into Cancer and Nutrition,
NSHDS Northern Sweden Health and Disease Study, HUNT The Trøndelag
Health Study, MCCS The Melbourne Collaborative Cohort Study, SCHS The
Singapore Chinese Health Study, CPS-II The Cancer Prevention Study II.
Identification and description of proteins associated with imminent lung
cancer
We used the Olink Proteomics ([33]https://www.olink.com/) platform to
measure relative concentrations of up to 1162 individual proteins
across 14 panels. We initially measured all available panels in samples
from 252 case-control pairs selected from the European Prospective
Investigation into Cancer and Nutrition (EPIC) study and the Northern
Sweden Health and Disease Study (NSHDS). Subsequently, among 479
additional case-control pairs selected from four additional cohorts, we
re-measured a subset of protein panels (totalling between 392 and 484
proteins), which were chosen to maximize coverage of the proteins with
the strongest risk associations (Supplementary Table [34]1). Controls
were matched to cases by age, date of blood draw, sex, cohort, and
smoking information in four categories (details in Methods section).
Quality control results are provided in Supplementary Data 2[35]a, b
and [36]3. For statistical analyses, we replaced protein measurements
below the lower limit of detection (LOD) with LOD/
[MATH: √2 :MATH]
according to the manufacturer’s recommendation.
Overall discovery analysis of proteins associated with lung cancer risk
We evaluated the association of each protein with risk of imminent lung
cancer diagnosis using conditional logistic regression models. The
associations between all 1162 proteins and lung cancer risk are
reported in Fig. [37]1 and Supplementary Data [38]4. In the full study
sample, there were 67 proteins associated with lung cancer after
accounting for multiple comparisons using the effective-number-of-tests
method^[39]8 (Supplementary Data [40]4). We subsequently implemented a
resampling procedure to simulate 500 iterations of an independent
discovery-replication design, which was designed to more stringently
identify proteins whose associations with lung cancer had high
reproducibility. As intended, the resampling algorithm identified a
smaller group of 36 proteins (Fig. [41]1, Supplementary Figs. [42]1 and
[43]2, Supplementary Data [44]5). A flow chart depicting this analysis
is presented in Supplementary Fig. [45]1.
Fig. 1. Identification of 36 protein biomarkers associated with risk of
imminent lung cancer diagnosis among 731 cases and 731 matched controls in
the Lung Cancer Cohort Consortium.
[46]Fig. 1
[47]Open in a new tab
The volcano plot depicts the lung cancer odds ratio per standard
deviation increment in relative protein concentrations (log-base-2
transformed) (x axis) and the −Log10 p value (y axis). The 36
identified markers of imminent lung cancer are labeled (see Methods;
markers were identified through a resampling process that measured the
association of each protein with lung cancer risk in a discovery set
and a replication set. The risk markers were required to have a
p < 0.05/effective-number-of-tests in the discovery set and p < 0.05 in
the replication set in at least 50% of the resampling iterations).
Source data are provided as a Source Data file.
Among the 36 markers identified by the resampling algorithm, all but
one (SCF) were positively associated with lung cancer risk
(Fig. [48]1). Among these, the estimated odds ratio per standard
deviation (OR[sd]) ranged from 1.31 (IGFBP-1, 95% confidence interval
[95% CI]: 1.17–1.47, p = 2 × 10^−6) to 2.43 (CEACAM5, 95% CI:
2.04–2.89, p = 2 × 10^−23) (Supplementary Data [49]4). The SCF protein
was negatively associated with lung cancer (OR = 0.74, 95% CI:
0.66–0.84, p = 1.24 × 10^−6). Compared with the PLCOm2012 model^[50]9,
a well-performing prediction model for smoking-related lung cancer
which uses questionnaire information, the individual proteins improved
discrimination between future lung cancer cases and controls by between
0.005 (OSM) and 0.082 (CEACAM5) units in the area under the receiver
operating curve (AUC) (Supplementary Data [51]4). All 36 proteins
showed good quality control measures and had less than 20% of values
below LOD (Supplementary Data [52]2, Supplementary Data [53]3).
In a sensitivity analysis, we compared the proteins that would be
identified if we used a single split-sample approach for discovery and
replication instead of our resampling algorithm (details in Methods
section). This showed that there were 29 proteins identified by both
methods, 7 markers identified only by the resampling algorithm, and 10
markers identified only by the single split-sample method
(Supplementary Fig. [54]3). Markers identified only by the resampling
algorithm typically had stronger risk associations in the full dataset
and were more consistently associated with risk across the six cohorts
compared with the proteins identified only by the single split-sample
method (Supplementary Data [55]6).
For the 36 proteins identified by the resampling algorithm as having
replicable associations with risk of imminent lung cancer diagnosis,
the following results describe their epidemiological and gene
expression characteristics, as well as their known relevance in
carcinogenesis.
Analyses considering stage at diagnosis, histological subtype, and lead time
Among cases with complete stage information at diagnosis, 256 of 334
cases were diagnosed at late stage (stage 3–4) (Table [56]1). A
majority of proteins (23 out of 36) showed stronger odds ratios for
late-stage compared with early-stage (stage 1–2) lung cancer, but a
clear difference (p-heterogeneity [p[het]] < 0.05) was only apparent
for two proteins (CXL17 and CEACAM5) (Supplementary Data [57]7,
Supplementary Fig. [58]4). Stage-stratified odds ratio and AUC
estimates are presented in Supplementary Data [59]7. For the subset of
lung cancer cases with available information on stage at diagnosis, we
estimated the stage at blood draw using sojourn times specific to
stage, histological type, and sex previously estimated by ten Haaf et
al.^[60]10. This suggested that 78% of cases were likely early stage
(stage 2 or earlier) at the time of blood draw (Supplementary
Fig. [61]5).
In Supplementary Data [62]8, we present associations between the 36
identified proteins and lung cancer risk by the major histological
subtypes and demographic strata (sex, smoking status, cohort, and
lead time). Most of the markers displayed consistent risk associations
across the major histological subtypes. Exceptions (p[het] < 0.05)
included CEACAM5, which was more strongly associated with
adenocarcinoma than squamous cell carcinoma, and MMP12, which was more
strongly associated with squamous cell carcinoma than with
adenocarcinoma (Supplementary Data [63]8, Supplementary Fig. [64]6).
When stratifying by lead time (time between blood draw and diagnosis),
19 proteins showed heterogeneity in associations (p[het] < 0.05,
Supplementary Data [65]8) and 11 had a clear trend in the strength of
association across categories of lead time (p[trend] < 0.05,
Supplementary Fig. [66]7, Supplementary Data [67]9). For instance,
EN-RAGE displayed little evidence for an association with lung cancer
at 2–3 years prior to diagnosis (OR[2–3y]: 1.10, 95% CI: 0.91–1.33),
but was strongly associated within one year of diagnosis (OR[<1y]:
2.49, 95% CI: 1.87–3.32, p[het] = 6 × 10^−6). A similar pattern was
observed for IL6 (OR[2–3y]: 1.36, 95% CI: 1.10–1.67 vs OR[<1y]: 2.56,
95% CI: 1.92–3.41, p[het] < 0.001).
Analyses considering smoking history and demographic factors
Stratified analysis by smoking status highlighted two proteins, IGFBP-1
and VWA1, that had stronger lung cancer risk associations in current vs
former smokers (p[het] < 0.05, Supplementary Data [68]8, Supplementary
Fig. [69]8). Additionally, accounting for smoking intensity, duration
and years since cessation resulted in very little attenuation of the OR
estimates (Fig. [70]2, Supplementary Data [71]10). When evaluating
cross-sectional relationships between protein concentrations and
smoking history metrics in controls using linear regression adjusted
for sex, age and cohort, we found that many markers had different
concentrations when comparing former and current smokers, but only
GDF-15 was associated with smoking intensity after accounting for
multiple comparisons (Supplementary Fig. [72]9a). We also found SCF
inversely associated with smoking duration. When analyzing lung cancer
cases and controls combined (whilst additionally accounting for
case-control status), we found several additional proteins associated
with smoking intensity and duration (Supplementary Fig. [73]9b).
Fig. 2. Lung cancer odds ratios for the 36 proteins associated with imminent
lung cancer diagnosis before and after detailed adjustment for smoking
intensity, duration, and years since cessation.
[74]Fig. 2
[75]Open in a new tab
Data for 95% confidence intervals are presented as
[MATH: e(<
/mo>β±1.96×sd) :MATH]
. β is the estimate from each conditional logistic regression, and sd
is their respective standard deviation. Number of samples used are
presented in Supplementary Data [76]10. Source data are provided as a
Source Data file.
Further risk analyses stratified by demographic factors did not
identify important heterogeneity in associations (Supplementary
Data [77]8). However, in a separate exploratory analysis in the SCHS
cohort, whose participants are of Han-Chinese descent, we found two
proteins, RFNG and S100A4, associated with lung cancer risk
(p < 0.05/effective-number-of-tests), despite showing little evidence
for an association among participants of European, US, or Australian
cohorts (Supplementary Fig. [78]10). The OR[sd] for RFNG in SCHS was
2.65 (95% CI: 1.62–4.33, n case sets: 90) compared with 1.07 (95% CI:
0.93–1.23, n case sets: 455) in the other cohorts (p[het] < 0.001), and
the OR[sd] for S100A4 in SCHS was 2.77 (95% CI: 1.72–4.44, n case sets:
92) compared with 1.03 (95% CI: 0.90–1.18, n case sets: 620) in the
other cohorts (p[het] < 0.001).
Relationships between risk proteins and their role in cancer development
To contextualize the biological roles of the identified markers in
cancer development, we assigned the proteins to one or more of the ten
hallmarks of cancer as defined by Hanahan and Weinberg^[79]11,[80]12
based on their description and functions available on GeneCards, the
Human Protein Atlas, Uniprot^[81]13–[82]15, and the pathways in which
they are implicated according to g:profiler^[83]16. Among the 36
markers, we found that 31 had documented functions within the hallmarks
of cancer (Fig. [84]3a). The most frequently implicated hallmark was
“activating invasion and metastasis”, to which 19 proteins where
assigned, including CEACAM5, MMP12, U-PAR and CDCP1. The second most
frequently implicated hallmark was “proliferative signaling”, to which
17 proteins were assigned. We also found many proteins (n = 14)
assigned to “angiogenesis” or “tumor promoting inflammation”. When
using g:Profiler^[85]16 to query the list of genes that code for the
identified proteins, we found that the most enriched pathways were
“extracellular region”, “responses to stimulus” and “regulation of
biological processes” (Supplementary Figs. [86]11 and [87]12,
Supplementary Table [88]2).
Fig. 3. Biological context of the 36 proteins associated with risk of
imminent lung cancer diagnosis.
[89]Fig. 3
[90]Open in a new tab
a Relationship between our 36 proteins and the 10 hallmarks of cancer
described by Hanahan and Weinberg, based on their descriptions and
functions available on GeneCards, the Human Protein Atlas, and Uniprot.
Each hallmark is represented by a different color. b Network analysis
among the 36 proteins, the figure depicts partial correlation networks
(accounting for sex, age, cohort, and all other identified proteins)
and stable protein associations. In lung cancer cases, no stable
connections were found for ANGPT2, CDCP1, CEACAM5, CFHR5, CXCL13,
IGFBP-1, IGFBP-2, IL6, MUC-16, SCF, SFTPA1, TFPI-2. In controls, no
stable connections were found for ANGPT2, CEACAM5, CFHR5, CXCL13,
CXCL9, IL6, MMP12, MUC-16, SCF, SFTPA1, and SYND1. Source data are
provided as a Source Data file.
To assess relationships between proteins, we first quantified pairwise
correlations between the 36 identified risk proteins using adjusted
Pearson correlation coefficients separately in cases and controls
(Supplementary Fig. [91]13). Most proteins were moderately and
positively correlated, except for SCF which was inversely correlated
with some proteins (as well as with lung cancer risk, see above). These
patterns were similar in cases and controls.
To consider the relationships among all proteins simultaneously, we
implemented sparse graphical network models adjusted for partial
correlations between proteins, separately in cases and controls
(Fig. [92]3b). We found U-PAR to be the most highly connected and
central protein in both the case and control networks (eight
connections among cases and nine among controls, Supplementary
Data [93]11). Although most protein connections were common to controls
and cases, we found evidence for three distinct clusters of proteins
with stable associations observed only among cases. One was centered
around SYND1 [Cluster[1]: U-PAR, IL2-RA, SYND1, HGF, and EN-RAGE], one
around VEGFA [Cluster[2]: VWA1, VEGFA and IFI30], and one around MK and
CXCL9 [Cluster[3]: MMP12, CXCL9, MK, and WFDC2]. The Cluster[1] network
was enriched for markers related to inflammatory response (g:profiler
pathway analyses P[adjusted] = 7.4 × 10^−3) and Cluster[3] was enriched
for proteins involved in homeobox six-3 transcription factor and
defense and immune responses (g:Profiler P[adjusted]: 4 × 10^−2,
g:Profiler P[adjusted]: 3 × 10^−2 and g:Profiler P[adjusted]:
4 × 10^−2). Notably, several of the proteins most strongly associated
with lung cancer, including CEACAM5, IL6, and SCF, were weakly
correlated with other markers and did not have any stable connections
with other identified risk markers (Fig. [94]3b).
Associations with mortality among individuals with lung cancer
Using Cox proportional hazards models, we evaluated the extent to which
the 36 risk proteins were associated with all-cause mortality following
lung cancer diagnosis using both blood concentrations and tumor gene
expression in TCGA samples. Whilst 20 proteins were nominally
associated (p < 0.05) with all-cause mortality when measured in blood
(Supplementary Fig. [95]14), these associations were weak in comparison
to the association with incident lung cancer risk. Only three proteins
(CEACAM5, CDCP1 and VEGFA) were associated with all-cause mortality
after accounting for multiple comparisons (Supplementary Data [96]12
and [97]13). Of the 20 proteins nominally associated with mortality,
three were also nominally associated with all-cause mortality when
assessed using tumor gene expression (CDCP1, CEACAM5, and U-PAR) in
TCGA.
Gene expression in normal and tumor tissue
We used data from GTEx to assess mRNA expression for the genes coding
for 36 risk proteins in normal tissue. Relative levels of mRNA
expression in various normal cell types for 35 markers are shown in
Fig. [98]4a (data was not available for TNFRSF6B). Three markers (ALPP,
SFTPA1, and MUC-16) were expressed primarily by lung cell types, while
4 others (IL2-RA, CXCL13, TNFSF13B, and EN-RAGE) were expressed
primarily in immune cells. For mRNA expression in tumor cell types from
TCGA, we found that most of the 36 markers were expressed in lung tumor
tissue to some degree, but also in a wide variety of other cancer types
(Fig. [99]4b). The only marker that appeared specifically expressed in
lung cancer tissue was SFTPA1.
Fig. 4. Gene expression of 36 protein biomarkers associated with risk of
imminent lung cancer diagnosis in normal and tumor tissue.
[100]Fig. 4
[101]Open in a new tab
Proteins are listed in order of their relative expression in
non-cancerous lung cell. a mRNA expression in normal tissue (gtex). b
mRNA expression in tumor tissue (TCGA). Source data are provided as a
Source Data file.
Discussion
The INTEGRAL project is a major initiative aiming to identify
circulating protein biomarkers of imminent—but yet-to-be diagnosed—lung
cancer. Based on blood samples drawn up to 3 years prior to clinical
lung cancer diagnosis, we used a high-throughput proteomics platform to
evaluate the association of up to 1162 circulating proteins with
imminent lung cancer diagnosis in 731 cases and 731 matched controls
from six prospective population cohorts. We identified 36 proteins
associated with risk of imminent lung cancer diagnosis, most of which
have not been previously identified as pre-diagnostic lung cancer
biomarkers.
The last decade has seen major investments in research aiming to
identify early cancer biomarkers. With the advent of early detection by
LDCT screening, a strong focus has been placed on lung cancer. A wide
array of circulating biomarkers have been proposed, including germline
gene variants^[102]17,[103]18, microRNA^[104]19,[105]20, epigenetic
markers^[106]21, autoantibodies^[107]22, protein
markers^[108]23,[109]24, and circulating tumor DNA^[110]25. However,
few have been independently validated, and none are widely used in
screening. In the INTEGRAL project, we decided to focus on circulating
proteins due to their demonstrated ability to improve the
discrimination of smoking-based risk prediction in an independent
validation population^[111]23,[112]24, as well as the prospect of
developing a clinical biomarker test at a reasonable cost and sample
volume requirement.
Our current study analyzed 1162 circulating proteins and found 67
proteins associated with lung cancer risk after accounting for multiple
testing. Following a resampling algorithm to simulate many iterations
of split-sample discovery and replication, we identified 36 proteins
with replicable associations with risk of imminent lung cancer
diagnosis, 35 of which showed positive associations with risk.
Comparing results from the resampling algorithm vs. a single-split
discovery/replication analysis demonstrated that our procedure for
identifying proteins is conservative, thus allowing us to comfortably
conclude that they are associated with risk of imminent lung cancer
across the studied populations. Six of the 36 markers have been
previously reported to be associated with lung cancer in pre-diagnostic
samples, including several well-known tumor markers such as CEACAM5/CEA
and CA-125/MUC-16^[113]24, as well as IL6, CDCP1, CXCL9 and
CXCL13^[114]26–[115]28.
We characterized the epidemiological properties of the identified
proteins and their associations to known risk factors such as smoking.
Despite several proteins being associated with smoking history
cross-sectionally^[116]29,[117]30, we found limited evidence for
heterogeneity in risk associations for most of the 36 markers when
stratifying by smoking status, and little impact of additional
adjustment for smoking characteristics. However, we did find stronger
risk associations for many of the 36 markers when measured in blood
drawn closer to diagnosis. This is expected for markers indicative of
forthcoming disease, as opposed to markers of disease etiology. Among
these proteins, two markers from the S100 family (EN-RAGE and S100A11)
displayed particularly strong associations closer to diagnosis.
Proteins in the S100 family are implicated in tumorigenesis and cancer
progression through different mechanisms of inflammation, cell
differentiation, and cell proliferation^[118]31, and have been proposed
as biomarkers for prognosis of melanoma^[119]32,[120]33. These
observations suggest that the risk associations are likely to reflect a
somatic response to (or the direct action of) a subclinical lung tumor,
rather than differences in tobacco exposure. Together with the risk
discrimination analysis that indicated improvements over the PLCOm2012
model for several individual proteins, they also suggest that the
identified markers provide additional risk information to that of
detailed smoking history. We plan to evaluate the extent to which a
combination of proteins may inform risk discrimination in a separate
study. Of note, some markers did not display stronger risk associations
closer to diagnosis, although we could only analyze trends over a
maximum of 3 years lead time, by design. Future studies should
therefore seek to describe patterns in risk associations for the
identified markers over longer lead times.
A potential role for the identified protein markers in early detection
of lung cancer is supported by our analysis estimating that 78% of
cases with known stage at diagnosis were stage 2 or earlier at the time
of blood draw, and 68% stage 1 or earlier, which suggests that the
markers may be able to detect many lung cancers at a curable stage.
Further, we observed improvements in risk discrimination when the
proteins were individually added to the established PLCOm2012
smoking-based risk prediction model. We find these results encouraging
given the overall aim of the INTEGRAL program to use these markers to
improve short-term lung cancer risk assessment prior to LDCT
screening^[121]7,[122]23,[123]24,[124]34.
When evaluating the known mechanistic roles of the 36 proteins, we
found that they have a wide range of molecular functions and include
multiple growth factors (HGF, MK, IGFBP-1, IGFBP-2, TGF-alpha, VEGFA),
tumor necrosis factor-receptors (TNFRSF6B, TNFRSF13B), and chemokines
and cytokines (CXL17, GDF-15, OSM, SCF). SCF, the only protein that we
found to be negatively associated with lung cancer, is involved in
regulation of cell survival, proliferation and hematopoiesis^[125]35.
The marker most strongly associated with lung cancer in our
study—CEACAM5 (CEA)—had a stronger association for adenocarcinoma than
for squamous cell carcinoma. CEACAM5 is a surface glycoprotein that is
involved in cell adhesion, intracellular signaling, and tumor
progression^[126]36. CEACAM5 is routinely used to monitor recurrence
among colorectal cancer patients^[127]37, and was recently highlighted
as a promising target for antibody-drug conjugate therapy of non-small
cell lung cancer^[128]38.
When mapping the identified markers to the hallmarks of cancer, we
found that the most frequently implicated hallmark was “activating
invasion and metastasis” (19 markers), which was associated with
proteins with known roles in the modulation of extracellular matrix
during metastasis such as MMP12 and U-PAR^[129]39,[130]40. The second
most frequently implicated hallmark was “proliferative signaling”,
which was associated with 17 markers, including growth factors such as
HGF^[131]41, TGF-alpha^[132]42, and IGFBP-2^[133]41. Changes in
proliferative signaling are common in lung tumors, as exemplified by
the impact of deleterious mutations in well-described oncogenes, such
as EGFR and KRAS^[134]43. The third most frequently implicated hallmark
(14 proteins) was “tumor-promoting inflammation”, including markers
such as CXCL9, CXCL13, CXL17, IL6, and IL2-RA. This highlights the
central role for inflammation and the immune system in responding to or
initiating the development of lung tumors^[135]11,[136]44. Inflammation
and metastasis in cancer are closely related^[137]45, as the invasion
of vital organs by a tumor is regulated by matrix metalloproteases
(MMP) and urinary plasminogen activator (UPA), both of which are
regulated by NF-κB (regulator of a large array of genes involved in
different processes of the immune and inflammatory responses)^[138]45.
“Angiogenesis” was also associated with 14 proteins, including ANGPT2,
CASP-8, and CEACAM5 which highlights the close relationship between
invasion and metastasis and angiogenesis^[139]46.
To better understand the relationships between the 36 markers, we
conducted a sparse graphical LASSO-based network analysis and observed
specific associations between 12 proteins among lung cancer cases that
did not appear among controls. These case-specific protein connections
were clustered in three groups and were all broadly implicated in an
extracellular defense response to somatic stress. In contrast,
connections that were specific to controls appeared to be more strongly
associated with a signaling response to cell proliferation. In seeking
to establish a risk prediction model including multiple proteins, we
would anticipate some redundancy in the risk discriminative performance
of connected proteins. An interesting observation was that several of
the proteins most strongly associated with lung cancer, including
CEACAM5, IL6, and SCF, did not have any stable connections with the
identified markers.
To understand why circulating concentrations of the identified proteins
are associated with lung cancer diagnosis, and to assess whether they
are likely to be specific to lung cancer—as opposed to cancer at other
sites—we used publicly available expression data for a range of normal
and tumor tissues. This analysis yielded two notable observations;
first, that only three proteins, ALPP, SFTPA1, and MUC-16, were
predominantly expressed in normal lung cells compared to cell types of
other origins. In contrast, several proteins appeared to be primarily
expressed by immune cells, although most were also expressed by other
cell types. The second notable observation was that only one
protein—SFTPA1—was predominantly expressed by lung tumor tissue
compared to other tumor tissues, whereas most proteins were expressed
in a wide range of cancer types. These complementary data suggest that
few of the identified markers are likely to have originated in
yet-to-be diagnosed lung tumor tissue, but rather are present in the
circulation as a somatic response to subclinical cancer.
Associations between the identified markers and all-cause mortality
after lung cancer diagnosis were weak. Three markers (U-PAR, CEACAM5,
and CDCP1) were also weakly associated with all-cause mortality when
measured as mRNA in lung tumor tissue in TCGA. Although these
associations do not appear important, also considering that stage was
not accounted for, they may be consistent with a role for some of the
identified markers in tumor progression or an immune or inflammation
response in lung tissue. For example, CDCP1 was previously associated
with an increased risk of lung cancer in pre-diagnostic blood^[140]28,
is overexpressed in lung cancer tissue^[141]47, and is associated with
metastases and poor prognosis^[142]47–[143]50. High U-PAR expression
has been found associated with lower overall survival in patients with
NSCLC^[144]51, and U-PAR is also studied as a therapeutic target in
cancer^[145]52.
The key strength of our study is our large, rich data resource which
was generated specifically to identify early detection markers of lung
cancer. The study design, with pre-diagnostic samples drawn up to 3
years prior to clinical (not screen-detected) lung cancer diagnosis,
ensured that identified markers were not influenced by the diagnosis
itself or subsequent treatment, as in a retrospective case-control
study of diagnosed cases^[146]53. By drawing samples from multiple
studies, we were able to verify the consistency of associations across
populations from the US, Europe, Southeast Asia, and Australia.
Furthermore, our sample size provided 80% power to identify markers
with an OR[sd] of at least 1.26 after considering multiple testing,
suggesting it is unlikely that we failed to identify any marker among
the 1162 proteins that is of major use for early detection. Future
discovery studies seeking to identify protein markers for early lung
cancer detection may therefore consider using our results as an initial
reference and focus additional investments on measuring non-overlapping
sets of markers.
An important limitation of our study was that information on clinical
stage was lacking for many cases. This limited our ability to
comprehensively evaluate whether the identified markers were primarily
driven by lung cancer diagnosed at late stage. However, based on the
stage information available, we did not observe important differences
between the OR estimates for early vs. late stage lung cancer.
Our controls were sampled directly from the same source population as
cases and were individually matched to cases by detailed smoking
characteristics, age, sex, and date of blood draw. This design protects
against multiple types of bias that frequently affect biomarker
studies. However, our nested case-control design does not readily allow
us to establish absolute risk models, nor to evaluate the utility of
our markers for risk prediction in the general population, because such
metrics are strongly influenced by the highly selected controls. As
described by Robbins et al.,^[147]7 we will address this question in a
large, independent validation phase by analyzing pre-diagnostic blood
samples from a larger sample of 1700 lung cancer cases and 2900
randomly selected cohort representatives including 10 additional
cohorts participating in the Lung Cancer Cohort Consortium.
In future work, we plan to study the dynamics of the identified markers
by evaluating repeat blood samples collected from the same individuals
over time. As the majority of study participants in the cohorts were of
European descent (except for the SCHS cohort which comprises mainly
Han-Chinese participants), an important future aim is to determine
whether any additional markers might be important specifically for
populations of non-European ancestry. In addition, our study focused
explicitly on people with a smoking history, and we consider it
unlikely that the most relevant set of markers for lung cancer among
people who never smoked were identified. Finally, we note that there is
substantial scope for future studies to explore the potential
biological roles of the identified markers in lung cancer development
and progression.
To summarize, after screening 1162 proteins, we identified 36 markers
of imminent lung cancer diagnosis with a wide range of functions and
relevance across the hallmarks of cancer. Forthcoming studies will
address the extent to which these markers can discriminate future lung
cancer cases and their utility for early detection. Our study provides
a potential view of the blood proteome in the years leading up to
diagnosis of smoking-related lung cancer and can serve as a reference
for investigations seeking to identify early protein markers of lung
cancer.
Methods
Ethical approval
The protocol of the Lung Cancer Cohort Consortium (INTEGRAL project)
was approved by the Ethics Committee of the International Agency for
Research on Cancer (Project number 11–13). This study involved only
secondary analysis of existing specimens and data. This research was
performed in accordance with the Declaration of Helsinki.
Study sample
A detailed justification for the study design and description of the
study sample is available in Robbins et al.^[148]7. In brief, we
included six prospective cohorts of diverse geographical origin amongst
cohorts participating in LC3, all of which collected plasma or serum
samples which were processed according to standard protocols and stored
at −80C or in liquid nitrogen. These included the European Prospective
Investigation into Cancer and Nutrition (EPIC)^[149]54 from several
countries in Europe, The Northern Swedish Health and Disease Study
(NSHDS)^[150]55 from Sweden, the Trøndelag Health Study (HUNT)^[151]56
from Norway, the American Cancer Society Cancer Prevention Study-II
(CPS-II)^[152]57 from the US, the Melbourne Collaborative Cohort
(MCCS)^[153]58 from Australia, and the Singapore Chinese Health Study
(SCHS)^[154]59 from Singapore (descriptions of each cohort are provided
in Robbins et al.^[155]7). Lung cancer cases were eligible if they
reported a current or former history of daily cigarette smoking at
recruitment and were diagnosed with a histologically confirmed lung
cancer (C34) at most three years after blood draw. Controls were
selected by incidence density sampling and matched 1:1 to cases based
on age at blood draw (±1 year, relaxed to ±3 years for sets without
available controls), date of blood draw (±1 month, relaxed to ±3
months), sex (self-reported), and cohort, as well as smoking status in
four categories (people who formerly smoked and quit <10 or ≥10 years
prior, and people who currently smoked <15 or ≥15 cigarettes per day).
The final study sample included 731 lung cancer cases and 731 matched
controls. All research participants provided written, informed consent,
and the study was approved by the relevant Institutional Review Boards.
Proteomic measurements
Circulating blood proteins were measured in plasma or serum using the
Olink platform at Olink Proteomics ([156]https://www.olink.com/) in
Uppsala, Sweden. The Olink platform is based on proximity extension
assays (PEA) that are highly sensitive, avoid cross-reactivity, and
have high reproducibility^[157]60. Relative concentrations of up to
1162 unique proteins, distributed over 14 Olink panels, were measured
by quantitative PCR (qPCR) (Supplementary Table [158]1). Measurements
are expressed as normalized protein expression (NPX) values which are
log-base-2 transformed. Details on quality control metrics and
coefficients of variation are available in the Supplementary Methods
and Supplementary Data [159]2a, b. Due to the high cost of Olink
assays, we initially measured the complete available protein library
only among the EPIC and NSHDS samples (n = 252 case-control pairs), and
then assayed the HUNT, CPS-II, SCHS and MCCS samples (n = 479
case-control pairs) for a subset of promising panels which included
between 392 and 484 proteins (see Robbins et al.^[160]7 and
Supplementary Table [161]1). For proteins measured on multiple panels
within a single cohort (n = 112 proteins with more than one
measurement), we used the measurement with the highest variance and
lowest missingness (see Supplementary Methods). Protein measurements
were standardized within each cohort.
Statistical analyses
The first step of our analysis aimed to identify proteins associated
with imminent lung cancer diagnosis. Instead of using a single
split-sample design, which can be subject to substantial influence from
random chance, we applied a resampling-based algorithm which simulates
a split-sample discovery and replication analysis repeated many times
with many different random splits of the data. Specifically, in each of
500 iterations, we split the data into discovery (70%) and replication
(30%) sets. In each of the 500 discovery and replication sets, we
applied conditional logistic regression to estimate the odds ratio of
lung cancer per standard deviation increment in relative
concentration (log-base-2 transformed) of each protein [OR[sd]]. We
applied this algorithm twice: once for the subset of 484 proteins
measured in all six cohorts, and separately for the 678 proteins
measured only in EPIC and NSHDS. In both algorithms, we balanced by
cohort when splitting the data into random discovery (70%) and
replication (30%) sets. In the algorithm including all 6 cohorts, we
also ‘forced’ EPIC and NSHDS into the discovery set in every iteration
since those data were used to choose the panels tested in the remaining
four cohorts (Supplementary Methods, Supplementary Fig. [162]1).
Additional details on how missing protein data were handled during the
resampling algorithm are in the Supplementary Methods.
We considered proteins to show replicable associations with imminent
lung cancer if, in at least 50% of iterations, the p value was below
p < 0.05/effective-number-of-tests (ENT)^[163]8 in the discovery set
and below 0.05 in the corresponding replication set. The ENT method
accounts for multiple testing by applying a Bonferroni correction, but
determines the number of independent tests as the number of principal
components needed to explain 95% of the variance in protein
abundance^[164]8.
As a sensitivity analysis, we assessed the difference between the
results of our resampling approach and a standard, single split-sample
design. Here, we included only EPIC and NSHDS in the discovery set,
since these data were used to choose the panels measured in the other
four cohorts, which were defined as the replication set. We identified
proteins that had a false-discovery-rate (FDR)-adjusted p value below
0.05 in the discovery set and a p value below 0.05 in the replication
set. We chose the less conservative FDR significance instead of ENT
significance because the power in the discovery set for the single
split-sample analysis was lower than in the resampling algorithm due to
smaller sample size.
For the group of markers identified as associated with imminent lung
cancer by the resampling algorithm, we carried out additional analyses
using the full dataset. For each marker, we calculated odds ratios for
lung cancer stratified by histological type, stage, smoking status,
cohort, and lead time (time between blood draw and diagnosis) and
examined trends by lead time (see Supplementary Methods). These
stratified analyses did not account for multiple comparisons. To
describe the association between each marker and smoking intensity,
duration, and time since cessation, we used linear regression models
fit among controls with adjustment for cohort, age, sex, and smoking
status. Similar analysis was run in the full dataset (among cases and
controls) while additionally adjusting for case-status. We also
estimated stage at the time of blood draw for participants with
available information on stage and histology using sojourn times
specific to stage, sex, and histological type previously estimated by
ten Haaf et al.^[165]10.
For the 36 identified proteins we ran pathway enrichment analysis using
g:Profiler^[166]16 to examine the biological processes in which they
are implicated, and we mapped these outcomes using Cytoscape version
3.9.1 with the EnrichmentMap and AutoAnnotate
applications^[167]61–[168]63. We then used the enrichment analysis
results along with information available on GeneCards, the Human
Protein Atlas, and Uniprot^[169]13–[170]15 to match each protein’s
function(s) to one or more of the Hallmarks of Cancer described by
Hanahan and Weinberg^[171]11,[172]12 in order to understand their
biological roles within the development of cancer.
We also examined relationships between the identified markers.
Separately among cases and controls, for pairs of proteins, we
calculated Pearson’s correlation coefficients between the residuals of
protein measurements after removing variance due to age, sex, and
smoking status (‘residualized proteins’). To consider the relationships
among all proteins simultaneously, we implemented sparse graphical
network models. These models use a graphical LASSO-based resampling
method on the partial correlations between residualized proteins to
estimate a sparse set of connections among a set of proteins (see
Supplementary Methods)^[173]64.
We subsequently evaluated the association between each identified
marker and overall survival among participants with lung cancer,
separately using circulating blood measurements and tumor gene
expression. For blood measurements, we applied Cox proportional hazards
regression based on the time from lung cancer diagnosis to death from
any cause, with stratification of the baseline hazard by cohort and sex
and adjustment for age at recruitment. Models also included an
interaction between lead time and the protein measurement, so that the
coefficient for the protein is interpretable as its effect at the time
of lung cancer diagnosis. For tumor gene expression, we extracted lung
tumor RNA-seq gene expression for 480 adenocarcinoma and 420 squamous
cell lung cancer patients from The Cancer Genome Atlas (TCGA) (see
Supplementary Methods).
We finally compared the cell-specific expression of the markers (mRNA
expression) in tissue extracted from cancer-free individuals with
expression in tumor tissue. Expression data were extracted from the
Human Protein Atlas^[174]65 and the Pathology Atlas^[175]66. Details of
these analyses are in the Supplementary Methods.
All statistical tests were two-sided, and all statistical analyses were
performed using R version 4.1.2.
Reporting summary
Further information on research design is available in the [176]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[177]Supplementary Information^ (2.5MB, pdf)
[178]Reporting Summary^ (920.6KB, pdf)
[179]41467_2023_37979_MOESM3_ESM.docx^ (24.2KB, docx)
Description of Additional Supplementary Files
[180]Supplementary Data^ (301.7KB, xlsx)
Acknowledgements