Graphical abstract
graphic file with name fx1.jpg
[47]Open in a new tab
Highlights
* •
We evaluated the causal relationship between proteins and lung
cancer in European
* •
39 robust causal proteins were identified which represented
potential drug targets
* •
We developed a CPRS, a promising approach for lung cancer risk
stratification
__________________________________________________________________
Bioinformatics; Cancer; Omics
Introduction
Lung cancer is the first leading cause of cancer death overall,[48]^1
and most lung cancer patients are diagnosed at late-stage where
curative treatment is rarely possible.[49]^2 Lung cancer is a
multifactorial malignant disease driven by environmental
exposure,[50]^3 genetic factors,[51]^4 and multi-omics
biomarkers.[52]^5 While the environmental and genetic polymorphisms
have been widely recognized, it is crucial to explore the downstream
biomarkers based on the genetic central dogma to uncover the complex
disease.[53]^6^,[54]^7 A variety of biomarkers have been shown to aid
in the early diagnosis of lung cancer,[55]^8^,[56]^9 but there are
still unmet clinical needs and technical challenges. Proteins function
as crucial hubs between genetics and phenotypes, and the complex
protein-protein interactions that mediate signaling pathways and
biological processes are essential causes of lung cancer
etiology.[57]^10^,[58]^11 Recently, proteomics-based risk model showed
promise in predicting incident lung cancer.[59]^12^,[60]^13 Naturally
genetic variation, either in close physical proximity to the
protein-encoding gene (cis-) or anywhere else in the genome (trans-),
has wide-ranging effects on protein structure and function, with
important implications for complex diseases.[61]^14 However, few
studies have investigated the causal proteins in lung cancer based on
large-scale population protein quantitative trait loci
(pQTLs).[62]^15^,[63]^16 Thus, leveraging data from human blood
proteome in Mendelian randomization (MR) studies facilitates deeper
characterization of circulating proteins causally associated with lung
cancer,[64]^17^,[65]^18 which identifying proteins causally related to
lung cancer could improve our understanding of the lung cancer genetic
structure, identifying drug targets, and high-risk screening evidence
for low-dose computed tomography (LDCT) screening. Additionally,
high-quality proteomic studies, such as the UK Biobank (UKB) Olink
project, will provide additional scientific opportunities.[66]^19
To systematically identify causal proteins, we investigated large-scale
genome-wide association studies (GWASs) with lung cancer in five
databases,[67]^20 including UKB, the prostate, lung, colorectal, and
ovarian (PLCO) cancer screening trial, the International Lung Cancer
OncoArray Consortium (ILCCO-OncoArray), the transdisciplinary research
in cancer of the lung (TRICL) research team, and FinnGen. Leveraging
pQTL summary statistics from the deCODE Genetics and Fenland cohorts,
we performed meta-analysis on proteome-wide MR (PW-MR) studies and
identified blood proteins causally linked to lung cancer. Finally, we
aim to develop an effective tool, causal-protein risk score, to predict
lung cancer risk and improve eligibility criteria for lung cancer
screening programs.
Results
Proteome-wide Mendelian randomization studies of lung cancer
Using pQTLs from 35,559 Icelanders in deCODE Genetics and 10,708
Europeans in Fenland, we investigated the causal association of 4,719
and 4,775 plasma proteins with lung cancer, separately. Genetic
association statistics of lung cancer were obtained from five cohorts
(30,312 LC cases and 652,902 controls from UKB, PLCO, ILCCO-OncoArray,
TRICL, and FinnGen) ([68]Table S1). We performed a meta-analysis of
PW-MR (PW-MR-meta) based on the two pQTLs and lung cancer genetic
associations ([69]Table S2). The workflow chart for the study design
was illustrated in [70]Figure 1. The PW-MR-meta identified 270 unique
causal plasma proteins significantly associated with lung cancer
(FDR-q < 0.05), and 7 overlaps were found with 351 protein-coding genes
of lung cancer GWAS catalog provided by NHGRI-EBI
([71]www.ebi.ac.uk/gwas/) ([72]Figure 2 and [73]Table S3). Among them,
182 proteins were identified in deCODE, and 127 proteins were
identified in Fenland ([74]Figures 3A, 3B, and [75]S1). What’s more, we
found 39 robust causal proteins (FDR-q < 0.05 in both datasets,
[76]Figures 3C and 3D), 78 moderate causal proteins (FDR-q < 0.05 in
one dataset and p < 0.05 in another) and 153 general causal proteins
(FDR-q < 0.05 in any dataset). Bayesian colocalization showed that 3
causal proteins (PRSS27, TAPBPL, and PSG5) had a shared single causal
signal (colocalization PP.H4 > 0.7) in deCODE. 2 causal proteins
(SERPING1 and GAA) had a shared single causal signal (colocalization
PP.H4 > 0.7), and 2 proteins (TLR3 and MICA) had medium moderate
support of colocalization (0.5 < PP.H4 < 0.7) in Fenland
([77]Table S4). We further analyzed the effect of the top 10
significantly robust casual proteins SNPs on lung cancer from different
GWAS data, which showed consistent effect direction ([78]Figure S2). In
Mendelian randomization results from the Proteome PheWAS
browser,[79]^21 12 of 39 proteins were available and examined in our
study, and seven were significantly associated with lung cancer or lung
adenocarcinoma or lung squamous cell carcinoma (p < 0.05)
([80]Figure S3 and [81]Table S5).
Figure 1.
[82]Figure 1
[83]Open in a new tab
Study workflow
Figure 2.
[84]Figure 2
[85]Open in a new tab
PhenoGram of PW-MR meta studies significant associations
The blue dots represent the general causal association proteins, the
green dots represent the moderate causal association proteins, the red
dots represent the robust casual association proteins. The dots
represent PW-MR significance, and the diamonds represent both PW-MR and
GWASs significance.
Figure 3.
[86]Figure 3
[87]Open in a new tab
PW-MR meta studies of lung cancer
Manhattan plot of lung cancer PW-MR meta study using (A) deCODE and (B)
Fenland data (The most significant 20 were labeled). The red dashed
line represents FDR-q <0.05.
(C) The dot plot of the odds ratios (ORs) and 95% confidence intervals
(CIs) of robust causal proteins (FDR-q <0.05 both dataset).
(D) Veen diagram depicting proteins associated with lung cancer in
deCODE only, in Fenland only, or in both.
Multi-omics analysis for the identified proteins
We performed a multi-omics analysis to integrate the identified robust
and moderate causal proteins into transcriptomics and proteomics from
CPTAC. 110 unique protein-coding genes and 81 proteins that passed
quality control were included. We found that the expression level of 33
protein-coding genes were different (FDR-q < 0.05, FC > 1.5 or
FC < 0.5). The well-known interleukin family and immune-related genes,
CDH3, IL20RB, IL36A, CD27, CD109, and CCL19 were upregulated in tumor
tissues, while IL3RA, IL18R1, and CDH5 were downregulated
([88]Figure 4A and [89]Table S6).
Figure 4.
[90]Figure 4
[91]Open in a new tab
Main results of the multi-omics analysis for the identified proteins
(A) Volcano plot for the FC values and -log(P) values for comparison of
gene expression.
(B) Volcano plot for the FC values and -log(P) values for comparison of
protein abundance.
(C) KEGG pathway network from the enrichment analysis of the robust and
moderate causal proteins.
(D) Protein-protein interaction network of the robust and moderate
causal proteins.
Differences in abundance was observed in 27 proteins in tumor and
adjacent normal tissues (FDR-q < 0.05, FC > 1.5 or FC < 0.5).
Interestingly, some proteins showed similar patterns with the
corresponding gene expression, such as CCL19, CDH3, CDH5, and IL3RA.
However, reverse trends were found in some proteins. For example, CD109
was downregulated in tumor tissues ([92]Figures 4B and [93]Table S7).
Further, we performed KEGG pathway enrichment analysis for the robust
and moderate causal proteins. Inflammation- and immune-related pathways
were identified, such as cytokine-cytokine receptor interaction (p =
7.05 × 10^−8) and viral protein interaction with cytokine and cytokine
receptor (p = 3.63 × 10^−5), as well as the metabolic- and classical
cancer-related pathways, including glycosphingolipid biosynthesis—globo
and isoglobo series (p = 3.95 × 10^−4), and PI3K-Akt signaling pathway
(p = 1.78 × 10^−2), ([94]Figures 4C and [95]Table S8).
Using the STRING database to integrate robust and moderate causal
proteins-protein interactions, we identified two main clusters: the
first cluster was related to the signal transduction (e.g., GZMB,
CXCL12, and FASLG) and immune system (e.g., IL1B, CCL19, and CD27); the
second cluster was related to metabolic pathways (e.g., C1GALT1C1 and
ST3GAL1) ([96]Figure 4D).
Causal proteins could identify high-risk population
For the CPRS and PRS development and validation, a total of 43,395
European with Olink proteomics data in UKB were included in the
prediction study. Among 117 robust and moderate causal proteins
(FDR-q < 0.05 in one dataset, FDR-q < 0.05 or p < 0.05 in another) in
PW-MR-meta analysis, 41 significant proteins were available in the UKB
individual-level data. Finally, 16 proteins, with consistent direction
of observational effect and causal effect, were selected to construct
CPRS to identify high-risk populations for lung cancer incidence
([97]Tables S9 and [98]S10, and [99]Figure S4). These proteins were
mainly enriched in inflammatory, metabolic, and neurological categories
([100]Figure S5).
The CPRS could stratify the lung cancer absolute incidence risk
significantly in overall (log rank p < 2.20 × 10^−16) and ever-smokers
(log rank p < 2.20 × 10^−16) in the UKB cohort. Further, all subjects
were categorized into 10 groups by the deciles of CPRS and PRS,
respectively. Compared with low-risk group (in the lowest tenth of the
CPRS), subjects at a high genetic risk group (in the top tenth of the
CPRS) were at significantly higher risk of lung cancer with a hazard
ratio (HR) of 4.33 (95%CI: 2.65–7.06, p < 4.32 × 10^−9) for overall
([101]Figure S6A) and 5.51 (95%CI: 3.24–9.38, p < 3.04 × 10^−10) for
ever-smokers ([102]Figure S6B), which outperformed PRS [top 10% versus
bottom 10%: HR = 2.59 (95%CI: 1.66–4.05) for overall; HR = 2.59 (95%CI:
1.62–4.17) for ever-smokers].
A cumulative effect of the CPRS and PRS was observed for incident lung
cancer according to the results from UKB. Compared with low-risk
population (in the lowest tenth of the CPRS), the high-risk persons (in
the top tenth of the CPRS) had a hazard ratio (HR) of 4.21 (95% CI:
2.58–6.87) ([103]Figure 5A). Similar prediction results were observed
in smokers, for ever-smoking population, the HR of high-risk persons
was 5,34 (95% CI: 3.14–9.08) ([104]Figure 5B). Meanwhile, the HRs of
PRS were 2.59 (95% CI: 1.66–4.05) and 2.59 (95% CI: 1.61–4.17),
respectively. Thus, compared with PRS, the CPRS had a more satisfactory
performance in lung cancer risk stratification. For CPRS, participants
at low-risk had a lower rate (35.92 per 100,000 person-years) of lung
cancer compared to that (232.24 per 100,000 person-years) of
participants with high risk. For PRS, participants at low-risk had a
lower rate (50.98 per 100,000 person-years) of lung cancer compared to
that (127.08 per 100,000 person-years) of participants with high risk
([105]Figure 5A). Besides, among CPRS group, a much higher cumulative
lung cancer incident was observed among ever-smokers at high genetic
risk compared with low-risk participants (336.20 vs. 50.18 per 100,000
person-years). Similar with CPRS group, top tenth of PRS group had much
higher cumulative lung cancer incidents (189.15 vs. 75.11 per 100,000
person-years) ([106]Figure 5B).
Figure 5.
[107]Figure 5
[108]Open in a new tab
Main results of discrimination evaluation for CPRS and PRS
(A) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS
(dotted line) in the overall UKB individual protein data.
(B) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS
(dotted line) in the smokers UKB individual protein data. The red line
indicates the high-risk persons, the yellow line indicates the
intermediate-risk population, and the blue line indicates the low-risk
persons. Hazard ratios and the 95% confidence intervals derived from
Cox regression model adjusting for age, sex, BMI, and smoking status
are provided in legend.
(C) The C-index values of CPRS and PRS generated by the Cox regression
model.
Further, we evaluated the discrimination abilities of CPRS and PRS
using the C-index and time-dependent AUC for lung cancer incidence. The
C-index of CPRS was 0.656 (95%CI: 0.631–0.681), outperforming the
traditional polygenic risk score (PRS) [0.560 (95%CI: 0.535–0.585)]
(p = 5.38 × 10^−8) ([109]Figure 5C). The time-dependent AUC of CPRS was
65.93 (95%CI: 62.91–68.78, ten-year follow-up), outperforming the PRS
[55.71 (95%CI: 52.67–58.59)] ([110]Figure S7). CPRS was found to have a
better discrimination power. After adjusting for non-genetic
confounders (age, gender, BMI, and smoking status), the discrimination
of CPRS (C-index[95%CI]: 0.777[0.757–0.797]) was still better than PRS
(C-index[95%CI]: 0.758[0.738–0.778]) (p = 5.87 × 10^−7), all of which
were outperformed than the model only containing the non-genetic
predictors (C-index[95%CI]: 0.751[0.731–0.771]) (p = 4.93 × 10^−13, p =
3.95 × 10^−4). The CPRS model was generally well calibrated than PRS
model ([111]Figure S8). These results suggest that the CPRS has the
ability to predict risk of lung cancer and that it potentially
optimizes the definition of sub-populations at high-risk in
individualized lung cancer prevention.
Discussion
In this study, we systematically evaluate causal relationship between
plasma proteins and lung cancer. We used GWAS summary statistics data
from the UKB, PLCO, ILCCO-OncoArray, TRICL, and FinnGen, and further
performed PW-MR analyses based on two large-scale pQTL populations.
Multi-omics analyses were performed to evaluate the functional evidence
of identified robust and moderated causal proteins. Moreover, we
constructed a causal protein risk score for lung cancer based on PW-MR
meta-analysis and further validated it in UKB Olink proteomics data.
We identified plasma proteins that were causally related to lung cancer
and may represent new therapeutic targets for the prevention or
treatment of lung cancer. Leveraging significant pQTLs in deCODE and
Fenland, 270 proteins were causally associated with lung cancer.
Granzyme B (GZMB), the strongest causal proteins, had protective
effects for lung cancer. GZMB is a serine protease most common in
cytotoxic lymphocyte and natural killer cells, which can induce
Gasdermin E (GSDME) dependent pyroptosis in tumor targets to activate
anti-tumor immunity both directly by cleaving GSDME and indirectly by
activating caspase 3.[112]^22 Ribonuclease T2 (RNASET2) is an RNase T2
enzyme that exists in the human body, which is the only extracellular
nuclease of RNase T2 family.[113]^23 RNASET2 expression was reduced in
primary ovarian tumors,[114]^24 melanoma,[115]^25 and non-Hodgkin’s
lymphoma.[116]^26 However, increased risk of lung cancer was associated
with increased expression of RNASET2,[117]^27 which is consistent with
our study. Interleukins can nurture an environment enabling and
favoring cancer growth while simultaneously being essential for a
productive tumor-directed immune response.[118]^28 In our study, ten
interleukin-proteins were found to be causally associated with lung
cancer, three of which were robust causal proteins. Major
histocompatibility complex (MHC) A/B stress proteins are upregulated in
response to DNA damage in many types of human cancers but expressed at
low or undetectable levels by healthy
cells.[119]^29^,[120]^30^,[121]^31^,[122]^32 MICA/B can induce tumor
immunity of T cells and nature killer (NK) cells, which shows promise
in cancer vaccine target as a cancer vaccine target.[123]^33
We observed strong functional evidence for the identified genes from
KEGG network in lung cancer tissues and adjacent normal tissues and
protein-protein interaction network. The signal transduction pathway is
related to various body functions and involved in some important
biological processes, including cell proliferation, differentiation,
apoptosis, immune regulation, and hematopoiesis.[124]^34 The immune
system is intrinsic to health. By broadly assessing human immune system
variation and considering interdependencies between immune system
components, we could provide evidence for cancer prevention or
treatment by modulating the immune system.[125]^35 The metabolic
pathway is closely related to tumor initiation and progression, tumor
microenvironment (TME), which can be depleted of certain nutrients that
force cancer cells to adapt by inducing nutrient scavenging mechanisms
to sustain cancer cell proliferation.[126]^36
The causal proteins improve the ability for lung cancer high-risk
population identification. It is widely recognized that early screening
for lung cancer is most likely beneficial when target tumor type has
relatively uniform biology and a slower rate of progression.[127]^37
Targeting high-risk populations with appropriate strategies for early
detection could get remarkable benefits of mortality reduction.[128]^38
However, the selection of population to be screened is a complex
procedure, with difficulty accurately identifying high-risk persons who
are most likely to benefit from screening. Plasma proteomes provide
insight into contributing biological factors, and we investigated their
potential value for future lung cancer prediction. By evaluating
C-index, time-dependent AUC, and risk stratification, we demonstrated
that proteins had better predictive power than PRS.
It is possible to combine CPRS with lung cancer screening strategies to
improve screening efficiency. People with high-risk should be screened
frequently and regularly (e.g., once every three years), which is
expected to further reduce the cancer mortality. Therefore, CPRS is
expected to serve as an informative benchmark to incorporate the PRS
and baseline information that have been used in cancer risk assessment.
Our work has several strengths. Firstly, through harmonizing multiple
large-scale GWASs, we comprehensively evaluated causal relationships
between plasma proteins and lung cancer. We identified blood proteins
causally linked with lung cancer through PW-MR-meta. Secondly, we
explored the relationship between identified proteins and lung cancer
at multi-omics levels, including genomics, transcriptomics, and
proteomics, which revealed the identified signals were functional.
Thirdly, we focused on the high-risk population stratification based on
proteins, while few studies developed risk scores using causal
proteins. We demonstrated the stable performance of CPRS across lung
cancer in the UKB Olink proteomics data, especially for its ability to
identify high-risk persons. Therefore, the CPRS might be a
complementary genetic risk assessment tool combined with the existing
screening guidelines.
In conclusion, this large-scale GWASs and PW-MR meta-analyses study for
lung cancer identified plasma proteins causally associated with lung
cancer as well as pathways related to this disease, which may be
further explored as possible therapeutic targets for lung cancer.
Furthermore, this study provides novel insights into population risk
stratification based on CPRS, which can be used as a valuable
supplement to existing lung cancer screening strategies.
Limitations of the study
It is essential to acknowledge the limitations of our study. Firstly,
although the proteins and CPRS weights were determined using the
PW-MR-meta information, the observational proteomics replication was
conducted in a subgroup of UKB Olink proteomics data only. External
proteomics studies should be conducted to validate these findings.
Secondly, we focused on European ancestry only. It is essential to
evaluate the associations of proteins and performance of CPRS in
non-European populations. Thirdly, we mainly investigated the causal
protein effects on population risk stratification. However, the
contribution of environmental factors should not be ignored.
Well-established risk prediction models incorporated with environmental
exposure factors, PRS, and CPRS should be developed for lung cancer.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
deCODE Genetics Zheng et al.[129]^21 [130]https://www.decode.com
Fenland cohort Pietzner et al.[131]^18
[132]https://www.omicscience.org/apps/pgwas/
UK Biobank [133]https://www.ukbiobank.ac.uk/
ILCCO-Oncoarray
[134]https://www.ncbi.nlm.nih.gov/projects/gap/cg-ibin/study.cgi?study_
id=phs001273.v3.p2
TRICL
[135]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_
id=phs001681.v1.p1
Finngen R6 [136]https://www.finngen.fi/
PLCO
[137]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi-study_
id=phs001286.v2.p2
CPTAC [138]https://pdc.esacinc.com/pdc/pdc
__________________________________________________________________
Software and algorithms
__________________________________________________________________
Minimac (version 4) Minimac4 - Genome Analysis Wiki ([139]umich.edu)
SAIGE (v1.1.6) Zhou et al.[140]^39
[141]https://github.com/weizhouUMICH/SAIGE/.
TwosampleMR R package Hemani et al.[142]^40 [143]https://www.mrbase.org
coloc R package Giambartolomei et al.[144]^41
[145]https://github.com/chr1swallace/coloc
ClusterProfiler R package Yu et al.[146]^42
[147]https://bioconductor.org/packages/release/bioc/html/clusterProfile
r.html.
STRING [148]https://cn.string-db.org/
PLINK (v 1.9) [149]https://www.cog-genomics.org/plink/
timeROC R package Github [150]https://github.com/cran/timeROC
SurvComp R package Github [151]https://github.com/bhklab/survcomp
[152]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources and reagents should be
directed to and will be fulfilled by the lead contact, Sipeng Shen
(sshen@njmu.edu.cn).
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
This paper analyzes existing, publicly available data. These access
URLs for the datasets are listed in the [153]key resources table.
* •
This paper does not report original code.
* •
Any additional information required to reanalyze the data reported
in this paper is available from the [154]lead contact upon request.
Experimental model and study participant details
Our study is computational that does not use experimental models
typical in the life sciences.
Method details
Study population and data collection
UKB
The UK Biobank (UKB) is a population-based prospective cohort of
individuals aged 40–69 years, enrolled between 2006 and 2010.[155]^43
The work described herein was approved by the UK Biobank under
applications 92675. All the phenotype data were accessed in March 2022.
Health-related outcomes were ascertained via individual record linkage
to national cancer and mortality registries and hospital in-patient
encounters. Cancer diagnoses were coded by International Classification
of Diseases version 10 (ICD-10) codes. Individuals with at least one
recorded incident diagnosis of a borderline, in situ, or primary
malignant cancer were defined as cases collected from data fields 41270
(Diagnoses - ICD10), 41202 (Diagnoses - main ICD10), 40006 (Type of
cancer: ICD10), and 40001 (primary cause of death: ICD10). The data
analyses were performed on DNAnexus Research Analysis Platform (RAP).
To minimize the possibility of including lung cancer metastasis, we
excluded lung cancer that occurred within 5 years of different primary
cancer. In addition, prevalent lung cancer cases diagnosed prior to
baseline enrollment were excluded. Finally, we analyzed 338,726
participants of European ancestry with 4,083 primary lung cancer cases
and 334,643 cancer-free controls.
PLCO
The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening
Trial is a large population-based randomized trial designed and
sponsored by the National Cancer Institute (NCI) to determine the
effects of screening on cancer-related mortality and secondary
endpoints in over 150,000 men and women aged 55 to 74.[156]^44
Participants have been under follow-up for cancer incidence and
mortality since the completion of screening procedure in 2006. In
addition, PLCO included a large biological sample biorepository which
has served as a unique resource for cancer research, particularly for
etiologic and early-marker studies.
Lung cancer diagnoses were coded by I International Classification of
Disease for Oncology version 2 (ICD-O-2). Only primary invasive lung
cancer diagnosed during the trial were included. Finally, we analyzed
98,651 participants of European ancestry with 2,455 primary lung cancer
cases and 96,196 controls.
ILCCO-OncoArray
OncoArray Consortium is a network created to increase understanding of
the genetic architecture of common cancers. The OncoArray GWAS was
originally designed to profiled genotype information of 57,775
participants, obtained from 29 studies across North America, Europe,
and Asia.[157]^27 All participants signed the informed consent, and the
studies were approved by the local internal review boards or ethics
committees and administered by trained personnel.
Tumors from patients were classified as adenocarcinomas, squamous
carcinomas, large-cell carcinomas, mixed adenosquamous carcinomas and
other NSCLC histologies following either the International
Classification of Diseases for Oncology (ICD-O) or World Health
Organisation coding.
FinnGen
FinnGen (FG) is a public-private partnership project combining
electronic health record and registry data from six regional and three
Finnish biobanks.[158]^45^,[159]^46 Participant data (with informed
consent) include genomics and health records linked to disease
endpoints. FinnGen participants provided informed consent for biobank
research. The FinnGen study is approved by Finnish Institute for Health
and Welfare.
We used summary-level data from FG participants with completed genetic
measurements and imputation. Association results for lung cancer and
cancer-free controls were downloaded (R6 data release).
deCODE genetics
The deCODE genetics database contains extensive genotype and phenotype
information. 4,719 proteins for 35,559 Icelanders summary data were
collected.[160]^47
Fenland cohort
The Fenland study is a population-based cohort of 12,435 participants
of Caucasian-ancestry born between 1950 and 1975 who underwent detailed
phenotyping at the baseline visit from 2005 to 2015. 4,775 proteins for
10,708 Caucasian summary data were collected.[161]^18
UKB Olink proteomics
Olink proteomics data was generated in UKB Pharma Proteomics
Project.[162]^19 We analyzed the initial batch of data which was
generated using the Olink Explore 1536 platform (1,463 proteins) on
43,395 European descent participants. Linkages to National disease and
death registries were used to identify incident lung cancers according
to ICD10 code of C34. Participants diagnosed with lung cancer prior to
recruitment were excluded. Additionally, participants without Olink
proteomics data from recruitment were excluded from all prediction
model construction. Finally, 490 newly diagnosed lung cancer were
recruited.
CPTAC
The National Cancer Institute’s Clinical Proteomic Tumor Analysis
Consortium (CPTAC) is a national effort to accelerate the understanding
of the molecular basis of cancer through the application of large-scale
proteome and genome analysis.[163]^48 Fragments Per Kilobase per
Million mapped reads (FPKM) values from lung tissue RNA sequencing data
were logarithmically transformed. The Tandem Mass Tag (TMT) labeled
global proteome was analyzed using ThermoFisher mass spectrometer
(Thermo Scientific).[164]^49 After filtering, 195 pairs of genes
expression and 200 pairs of proteins abundance of participants’ tumor
and adjacent normal tissues were analyzed.
Quality control for the SNP array data
The genotype data of UKB were imputed into the dataset using
computationally efficient methods combined with the Haplotype Reference
Consortium (HRC) and UK10K haplotype resource. Details of the genotype
data were described in data-fields 22418 and 22828.
The genotype data of PLCO were generated from the Illumina GSA (673,132
markers), Oncoarray (474,276 markers), and historical data including
Illumina OmniExpress (OmniX) (715,823 markers), Omni2.5M (Omni25)
(2,310,570 markers) and Human Quad 610 (580,912 markers) SNP arrays.
The duplicated samples between different platforms were removed.
Genotyping of 533,631 SNPs in ILCCO-OncoArray was completed at the
Center for Inherited Disease Research, the Beijing Genome Institute,
the Helmholtz Zentrum München, Copenhagen University Hospital and the
University of Cambridge in Illumina Infinium OncoArray
platform.[165]^50 Before standard quality control, we removed the
intentionally duplicated samples and samples from unrelated OncoArray
studies and HapMap control individuals of European, African, Chinese
and Japanese origins.
The genotype data of TRICL were generated from the Affymetrix Axiom
Array containing 414,504 markers, which was a custom panel of key LC
GWAS markers, and rare coding SNVs and indels.
Further excluded were those who lacked disease status, were
second-degree relatives or closer having identity by descent
(IBD) > 0.2 or had low-quality DNA (call rate < 95%), or sex
inconsistency, or were non-European. SNPs were removed if meeting any
of the following criteria: (1) sex chromosome, (2) minor allele
frequency (MAF) < 0.05, (3) call rate < 95%, and (4) Hardy-Weinberg
equilibrium (HWE) test P < 1.00×10^−7 in controls or P < 1.00×10^−12 in
cases.
The summary-level data of FinnGen cohort was based on the chip genotype
data with Illumina and Affymetrix arrays, and then imputed by using the
population-specific SISu v3 imputation reference panel.
Data processing of the deCODE Genetics and Fenland cohort
deCODE Genetics sequenced the whole genome of 49,708 Icelanders to a
median of 32× using Illumina technology.[166]^47 In total, deCODE
Genetics has genotyped 166,281 Icelanders with Illumina SNP chips,
long-range phased and imputed based on the sequenced dataset, of which
35,559 had proteins measured by SomaScan version 4. Aptamers for
non-human proteins and aptamers listed as deprecated by SomaScan as
well as aptamers mapping to multiple genes were excluded, leaving 4,907
aptamers target a total of 4,719 unique proteins.
Fenland participants were genotyped using three genotyping arrays:
Illumina Infinium Core Exome 24v1, Affymetrix SNP 5.0, and Affymetrix
UK Biobank Axiom array for 12,435 participants of White British
ancestry, which resulted in > 17 million genotyped or imputed variants
with a minor allele frequency of > 0.1%. Relative protein abundances of
4,775 human protein targets were evaluated by 4,979 aptamers (SomaLogic
V4), which only included human protein targets and SOMAmers deemed
acceptable by SomaLogic and that had a value between 0.25 and 4 for all
scaling factors.[167]^14 The 4775 proteins were used to perform the
association study.
Imputation based on TOPMed imputation server
To estimate missing genotype information, we performed an imputation on
the TOPMed online imputation server, which phased haplotypes with Eagle
v2.4[168]^51 using TOPMed data (Version r2) as a reference panel that
included 97,256 reference samples and 308,107,085 genetic
variants.[169]^52 The server performed imputations for PLCO,
ILCCO-OncoArray, and TRICL using Minimac (version 4) software.
All genetic variants were lifted to GRCh38 coordinates to maintain
consistency with the UKB WES project. Poorly imputed SNVs with
imputation quality score R^2 < 0.4 and SNVs on sex chromosomes were
excluded from the analyses.
Association analyses of single-variant level
We performed single-variant association tests for common variants (MAF
≥ 0.01) using Scalable and Accurate Implementation of GEneralized mixed
model SAIGE (v1.1.6).[170]^39 SAIGE is a toolkit developed for
genome-wide association tests in biobank-level datasets that uses
saddlepoint approximation to handle extreme case-control imbalances of
binary traits and linear mixed models to account for sample
relatedness.
Proteome-wide MR (PW-MR) analyses
We performed Proteome-Wide Mendelian Randomization (PW-MR) based on
cis- and trans pQTL from deCODE and Fenland, and GWAS summary data from
UKB, PLCO, ILCCO-OncoArray, and TRICL, and FinnGen. For instrument
variable selection, we retained SNPs which were genome-wide significant
(both cis-SNP and trans-SNP, P < 5.0 × 10^-8) in each pQTL in our
study. In order to minimize correlated horizontal pleiotropy, we
retained SNPs independent of each other (LD windows 10000kb, R^2 < 0.1
in 1000G) in pQTLs. To quantify the statistical power of the pQTLs,
strength of SNPs was evaluated by F-statistics, where F-statistics
threshold ≥ 10 of IV indicates sufficient statistical strength.[171]^53
If any IVs had F-statistics < 10, we considered those to have limited
power (potentially causing weak instrument bias[172]^54 and removed
these from the MR.
Bayesian colocalization analysis using COLOC
To further investigate whether association signals with robust causal
proteins and lung cancer were derived by shared causal variant, we
performed a Bayesian colocalization analysis of 270 causal proteins and
lung cancer in deCODE and Fenland. Colocalization analysis relies on a
single causal variant assumption, and the posterior probability (PP)
for five hypotheses at each pleiotropic locus is provided: (i) H[0]:
neither trait has a genetic association in the region; (ii) H[1]: only
trait 1 has a genetic association in the region; (iii) only trait 2 has
a genetic association in the region; (iv) both traits are associated,
but with different causal variants; (v) both traits are associated and
shar a single causal variant. The prior probabilities were set as
p[1]=10^-4, p[2]=10^-4, p[12]=10^-5. For each robust causal protein,
regions were defined as area within 500kb of selected variants. The
posterior probability for shared a causal variant (PP.H4) >0.7 was
considered to have a strong support of colocalization. Medium
colocalization indication was defined as 0.5< PP.H4 <0.7.
Comparison analyses for gene expression or protein abundance in tumor and
adjacent normal tissues
In our study, on the tissue level, we integrating transcriptomic and
proteomic measurements to validate whether MR-identified robust and
moderate causal proteins and protein-coding genes were observably
associated with lung cancer. The paired gene expression and protein
abundance data tumor and adjacent normal tissues was collected from
CPTAC. The protein abundances were further grouped by unique gene names
using sum of all the protein abundances belonging to the identical gene
name.
Pathway enrichment analysis
In our study, pathway enrichment analysis was performed to identifies
specific biological pathways as being particularly abundant in a list
of protein-coding genes of MR-identified robust and moderate causal
proteins. We collected the pathway information with gene sets from the
KEGG database, containing a total of 213 pathways. All enrichment
analyses were performed using the R package clusterProfiler.[173]^42
Protein-protein interaction analysis
To further understand the protein-protein interactions, we used the
STRING database, which considered both physical interactions as well as
functional associations.[174]^55 The protein interaction network was
clustered into different colors using Markov Clustering (MCL).
Development and validation of the risk score based on causal proteins
We developed a causal-protein risk score (CPRS) for population risk
stratification based on lung cancer causal proteins. To perform
independent validation phase, the PW-MR-meta causal effect defined as
weights of selected proteins. The CPRS was constructed included robust
and moderate causal proteins (FDR-q < 0.05 in one data set,
FDR-q < 0.05 or P < 0.05 in another) in the PW-MR-meta analysis with
consistent direction of observational (Cox proportional hazards model)
effect and causal effect.
The CPRS was generated as:
[MATH: CPRS=∑1n
βiPi :MATH]
, where
[MATH: βi :MATH]
denoted the coefficient of the i^th protein P[i] calculated by
PW-MR-meta ((Beta[deCODE] + Beta[Fenland])/2).
In the validation phase, the protein panel with previously determined
weights was used to generate CPRS in the UKB Olink proteomics data.
We used person-year to describe the absolute lung cancer incidence
risk, which was defined as the time gap from the data of cohort
enrollment to lung cancer diagnosis or the last follow-up, whichever
came first.
Polygenic risk score generation
Polygenic risk score (PRS) was constructed as the sum of the number of
minor alleles of SNPs participants carries, weighted by their effect
size as Iog-odds ratio. We used PRS-128 (128 SNPs) to generate
PRS[175]^56^,[176]^57 based on UKB Olink proteomics data using PLINK
1.9 ([177]https://www.cog-genomics.org/plink/). These SNPs were
collected from the known susceptibility loci of lung cancer and
conditions related to lung cancer (such as lung function impairment)
previously identified through literature curation and NHGRI-EBI GWAS
Catalog ([178]https://www.ebi.ac.uk/gwas/), and additional loci that
passed the suggestive significance-level in GWAS studies. When
correlation exists, variants representing independent loci with the
strongest statistical significance were retained.
Qualification and statistical analysis
Statistics and software
The genome-wide association testing of 4,719 plasma protein were
preformed using linear mixed model with 27.2 million imputed variants
as genotypes after adjusting rank-inverse normal transformed levels for
age, sex, and sample age for the deCODE Health study. The
likelihood-ratio test to compute all P values.[179]^47
PW-MR analyses were performed using Wald ratio (proteins with only one
available SNP) or inverse variance weighted (IVW)
method[180]^58^,[181]^59 for all other proteins by TwoSampleMR
package[182]^40 to estimate the causal effect of blood proteins on lung
cancer. MR Steiger test of causality directionality was performed using
‘directionality_test’ function in TwoSampleMR package.[183]^40 The
presence of pleiotropy was further investigated using MR-PRESSO and
MR-Egger method[184]^60 to estimate the potential effect of
pleiotropy.[185]^61 For MR-Egger intercept P value<0.05, we considered
these protein-disease signals as influenced by horizontal pleiotropy.
We also applied Cochran’s Q test to estimate the potential
heterogeneity of MR estimates (P value<0.05). Proteins with horizontal
pleiotropy and heterogeneity were excluded from any of the follow-up
analyses. The results of PW-MR were summarized by meta-analysis
(PW-MR-meta). A 5% FDR correction threshold was applied to correct for
multiple testing. Bayesian colocalization was performed using the
“coloc” R package.[186]^41
Paired t-test (FDR-q < 0.05) and fold change (FC > 1.5 or FC < 0.5)
were used to identify the differential gene expression or protein
abundance.
Cox proportional hazards models Hazard ratios were used to evaluated
the association between CPRS, PRS and lung cancer risk adjusting for
age, sex, body mass index (BMI), and smoking status. (HRs) and 95%
confidence interval (CI) were calculated. Participants were classified
into ten equal parts according to the distribution of CPRS and PRS,
respectively. And we compared hazard ratios (HR) for each part with
those at the lowest tenth. Individuals within the top 10%, 10%-90%, and
the bottom 10% of CPRS and PRS were considered as populations at high,
intermediate, and low genetic risk respectively. We also used Cox
regression to calculate and compared cumulative incidence of CPRS and
PRS for lung cancer risk in each three subgroups, respectively. The
discrimination performance of the risk scores were evaluated by
Harrell’s C-index and the time-dependent area under receiver operating
characteristic curve (AUC) using the R package timeROC. We used 10-fold
internal cross-validation (repeat 500 times) that adjusted AUC for
overfitting. Student t-test was performed to analysis the differences
of Harrell’s C-index using the “cindex.comp” function of R package
SurvComp.[187]^62 All data analyses were performed using R software
(version 4.2.3, [188]https://www.r-project.org/).
Acknowledgments