Graphical abstract

   graphic file with name fx1.jpg
   [47]Open in a new tab

Highlights

     * •
       We evaluated the causal relationship between proteins and lung
       cancer in European
     * •
       39 robust causal proteins were identified which represented
       potential drug targets
     * •
       We developed a CPRS, a promising approach for lung cancer risk
       stratification
     __________________________________________________________________

   Bioinformatics; Cancer; Omics

Introduction

   Lung cancer is the first leading cause of cancer death overall,[48]^1
   and most lung cancer patients are diagnosed at late-stage where
   curative treatment is rarely possible.[49]^2 Lung cancer is a
   multifactorial malignant disease driven by environmental
   exposure,[50]^3 genetic factors,[51]^4 and multi-omics
   biomarkers.[52]^5 While the environmental and genetic polymorphisms
   have been widely recognized, it is crucial to explore the downstream
   biomarkers based on the genetic central dogma to uncover the complex
   disease.[53]^6^,[54]^7 A variety of biomarkers have been shown to aid
   in the early diagnosis of lung cancer,[55]^8^,[56]^9 but there are
   still unmet clinical needs and technical challenges. Proteins function
   as crucial hubs between genetics and phenotypes, and the complex
   protein-protein interactions that mediate signaling pathways and
   biological processes are essential causes of lung cancer
   etiology.[57]^10^,[58]^11 Recently, proteomics-based risk model showed
   promise in predicting incident lung cancer.[59]^12^,[60]^13 Naturally
   genetic variation, either in close physical proximity to the
   protein-encoding gene (cis-) or anywhere else in the genome (trans-),
   has wide-ranging effects on protein structure and function, with
   important implications for complex diseases.[61]^14 However, few
   studies have investigated the causal proteins in lung cancer based on
   large-scale population protein quantitative trait loci
   (pQTLs).[62]^15^,[63]^16 Thus, leveraging data from human blood
   proteome in Mendelian randomization (MR) studies facilitates deeper
   characterization of circulating proteins causally associated with lung
   cancer,[64]^17^,[65]^18 which identifying proteins causally related to
   lung cancer could improve our understanding of the lung cancer genetic
   structure, identifying drug targets, and high-risk screening evidence
   for low-dose computed tomography (LDCT) screening. Additionally,
   high-quality proteomic studies, such as the UK Biobank (UKB) Olink
   project, will provide additional scientific opportunities.[66]^19

   To systematically identify causal proteins, we investigated large-scale
   genome-wide association studies (GWASs) with lung cancer in five
   databases,[67]^20 including UKB, the prostate, lung, colorectal, and
   ovarian (PLCO) cancer screening trial, the International Lung Cancer
   OncoArray Consortium (ILCCO-OncoArray), the transdisciplinary research
   in cancer of the lung (TRICL) research team, and FinnGen. Leveraging
   pQTL summary statistics from the deCODE Genetics and Fenland cohorts,
   we performed meta-analysis on proteome-wide MR (PW-MR) studies and
   identified blood proteins causally linked to lung cancer. Finally, we
   aim to develop an effective tool, causal-protein risk score, to predict
   lung cancer risk and improve eligibility criteria for lung cancer
   screening programs.

Results

Proteome-wide Mendelian randomization studies of lung cancer

   Using pQTLs from 35,559 Icelanders in deCODE Genetics and 10,708
   Europeans in Fenland, we investigated the causal association of 4,719
   and 4,775 plasma proteins with lung cancer, separately. Genetic
   association statistics of lung cancer were obtained from five cohorts
   (30,312 LC cases and 652,902 controls from UKB, PLCO, ILCCO-OncoArray,
   TRICL, and FinnGen) ([68]Table S1). We performed a meta-analysis of
   PW-MR (PW-MR-meta) based on the two pQTLs and lung cancer genetic
   associations ([69]Table S2). The workflow chart for the study design
   was illustrated in [70]Figure 1. The PW-MR-meta identified 270 unique
   causal plasma proteins significantly associated with lung cancer
   (FDR-q < 0.05), and 7 overlaps were found with 351 protein-coding genes
   of lung cancer GWAS catalog provided by NHGRI-EBI
   ([71]www.ebi.ac.uk/gwas/) ([72]Figure 2 and [73]Table S3). Among them,
   182 proteins were identified in deCODE, and 127 proteins were
   identified in Fenland ([74]Figures 3A, 3B, and [75]S1). What’s more, we
   found 39 robust causal proteins (FDR-q < 0.05 in both datasets,
   [76]Figures 3C and 3D), 78 moderate causal proteins (FDR-q < 0.05 in
   one dataset and p < 0.05 in another) and 153 general causal proteins
   (FDR-q < 0.05 in any dataset). Bayesian colocalization showed that 3
   causal proteins (PRSS27, TAPBPL, and PSG5) had a shared single causal
   signal (colocalization PP.H4 > 0.7) in deCODE. 2 causal proteins
   (SERPING1 and GAA) had a shared single causal signal (colocalization
   PP.H4 > 0.7), and 2 proteins (TLR3 and MICA) had medium moderate
   support of colocalization (0.5 < PP.H4 < 0.7) in Fenland
   ([77]Table S4). We further analyzed the effect of the top 10
   significantly robust casual proteins SNPs on lung cancer from different
   GWAS data, which showed consistent effect direction ([78]Figure S2). In
   Mendelian randomization results from the Proteome PheWAS
   browser,[79]^21 12 of 39 proteins were available and examined in our
   study, and seven were significantly associated with lung cancer or lung
   adenocarcinoma or lung squamous cell carcinoma (p < 0.05)
   ([80]Figure S3 and [81]Table S5).

Figure 1.

   [82]Figure 1
   [83]Open in a new tab

   Study workflow

Figure 2.

   [84]Figure 2
   [85]Open in a new tab

   PhenoGram of PW-MR meta studies significant associations

   The blue dots represent the general causal association proteins, the
   green dots represent the moderate causal association proteins, the red
   dots represent the robust casual association proteins. The dots
   represent PW-MR significance, and the diamonds represent both PW-MR and
   GWASs significance.

Figure 3.

   [86]Figure 3
   [87]Open in a new tab

   PW-MR meta studies of lung cancer

   Manhattan plot of lung cancer PW-MR meta study using (A) deCODE and (B)
   Fenland data (The most significant 20 were labeled). The red dashed
   line represents FDR-q <0.05.

   (C) The dot plot of the odds ratios (ORs) and 95% confidence intervals
   (CIs) of robust causal proteins (FDR-q <0.05 both dataset).

   (D) Veen diagram depicting proteins associated with lung cancer in
   deCODE only, in Fenland only, or in both.

Multi-omics analysis for the identified proteins

   We performed a multi-omics analysis to integrate the identified robust
   and moderate causal proteins into transcriptomics and proteomics from
   CPTAC. 110 unique protein-coding genes and 81 proteins that passed
   quality control were included. We found that the expression level of 33
   protein-coding genes were different (FDR-q < 0.05, FC > 1.5 or
   FC < 0.5). The well-known interleukin family and immune-related genes,
   CDH3, IL20RB, IL36A, CD27, CD109, and CCL19 were upregulated in tumor
   tissues, while IL3RA, IL18R1, and CDH5 were downregulated
   ([88]Figure 4A and [89]Table S6).

Figure 4.

   [90]Figure 4
   [91]Open in a new tab

   Main results of the multi-omics analysis for the identified proteins

   (A) Volcano plot for the FC values and -log(P) values for comparison of
   gene expression.

   (B) Volcano plot for the FC values and -log(P) values for comparison of
   protein abundance.

   (C) KEGG pathway network from the enrichment analysis of the robust and
   moderate causal proteins.

   (D) Protein-protein interaction network of the robust and moderate
   causal proteins.

   Differences in abundance was observed in 27 proteins in tumor and
   adjacent normal tissues (FDR-q < 0.05, FC > 1.5 or FC < 0.5).
   Interestingly, some proteins showed similar patterns with the
   corresponding gene expression, such as CCL19, CDH3, CDH5, and IL3RA.
   However, reverse trends were found in some proteins. For example, CD109
   was downregulated in tumor tissues ([92]Figures 4B and [93]Table S7).

   Further, we performed KEGG pathway enrichment analysis for the robust
   and moderate causal proteins. Inflammation- and immune-related pathways
   were identified, such as cytokine-cytokine receptor interaction (p =
   7.05 × 10^−8) and viral protein interaction with cytokine and cytokine
   receptor (p = 3.63 × 10^−5), as well as the metabolic- and classical
   cancer-related pathways, including glycosphingolipid biosynthesis—globo
   and isoglobo series (p = 3.95 × 10^−4), and PI3K-Akt signaling pathway
   (p = 1.78 × 10^−2), ([94]Figures 4C and [95]Table S8).

   Using the STRING database to integrate robust and moderate causal
   proteins-protein interactions, we identified two main clusters: the
   first cluster was related to the signal transduction (e.g., GZMB,
   CXCL12, and FASLG) and immune system (e.g., IL1B, CCL19, and CD27); the
   second cluster was related to metabolic pathways (e.g., C1GALT1C1 and
   ST3GAL1) ([96]Figure 4D).

Causal proteins could identify high-risk population

   For the CPRS and PRS development and validation, a total of 43,395
   European with Olink proteomics data in UKB were included in the
   prediction study. Among 117 robust and moderate causal proteins
   (FDR-q < 0.05 in one dataset, FDR-q < 0.05 or p < 0.05 in another) in
   PW-MR-meta analysis, 41 significant proteins were available in the UKB
   individual-level data. Finally, 16 proteins, with consistent direction
   of observational effect and causal effect, were selected to construct
   CPRS to identify high-risk populations for lung cancer incidence
   ([97]Tables S9 and [98]S10, and [99]Figure S4). These proteins were
   mainly enriched in inflammatory, metabolic, and neurological categories
   ([100]Figure S5).

   The CPRS could stratify the lung cancer absolute incidence risk
   significantly in overall (log rank p < 2.20 × 10^−16) and ever-smokers
   (log rank p < 2.20 × 10^−16) in the UKB cohort. Further, all subjects
   were categorized into 10 groups by the deciles of CPRS and PRS,
   respectively. Compared with low-risk group (in the lowest tenth of the
   CPRS), subjects at a high genetic risk group (in the top tenth of the
   CPRS) were at significantly higher risk of lung cancer with a hazard
   ratio (HR) of 4.33 (95%CI: 2.65–7.06, p < 4.32 × 10^−9) for overall
   ([101]Figure S6A) and 5.51 (95%CI: 3.24–9.38, p < 3.04 × 10^−10) for
   ever-smokers ([102]Figure S6B), which outperformed PRS [top 10% versus
   bottom 10%: HR = 2.59 (95%CI: 1.66–4.05) for overall; HR = 2.59 (95%CI:
   1.62–4.17) for ever-smokers].

   A cumulative effect of the CPRS and PRS was observed for incident lung
   cancer according to the results from UKB. Compared with low-risk
   population (in the lowest tenth of the CPRS), the high-risk persons (in
   the top tenth of the CPRS) had a hazard ratio (HR) of 4.21 (95% CI:
   2.58–6.87) ([103]Figure 5A). Similar prediction results were observed
   in smokers, for ever-smoking population, the HR of high-risk persons
   was 5,34 (95% CI: 3.14–9.08) ([104]Figure 5B). Meanwhile, the HRs of
   PRS were 2.59 (95% CI: 1.66–4.05) and 2.59 (95% CI: 1.61–4.17),
   respectively. Thus, compared with PRS, the CPRS had a more satisfactory
   performance in lung cancer risk stratification. For CPRS, participants
   at low-risk had a lower rate (35.92 per 100,000 person-years) of lung
   cancer compared to that (232.24 per 100,000 person-years) of
   participants with high risk. For PRS, participants at low-risk had a
   lower rate (50.98 per 100,000 person-years) of lung cancer compared to
   that (127.08 per 100,000 person-years) of participants with high risk
   ([105]Figure 5A). Besides, among CPRS group, a much higher cumulative
   lung cancer incident was observed among ever-smokers at high genetic
   risk compared with low-risk participants (336.20 vs. 50.18 per 100,000
   person-years). Similar with CPRS group, top tenth of PRS group had much
   higher cumulative lung cancer incidents (189.15 vs. 75.11 per 100,000
   person-years) ([106]Figure 5B).

Figure 5.

   [107]Figure 5
   [108]Open in a new tab

   Main results of discrimination evaluation for CPRS and PRS

   (A) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS
   (dotted line) in the overall UKB individual protein data.

   (B) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS
   (dotted line) in the smokers UKB individual protein data. The red line
   indicates the high-risk persons, the yellow line indicates the
   intermediate-risk population, and the blue line indicates the low-risk
   persons. Hazard ratios and the 95% confidence intervals derived from
   Cox regression model adjusting for age, sex, BMI, and smoking status
   are provided in legend.

   (C) The C-index values of CPRS and PRS generated by the Cox regression
   model.

   Further, we evaluated the discrimination abilities of CPRS and PRS
   using the C-index and time-dependent AUC for lung cancer incidence. The
   C-index of CPRS was 0.656 (95%CI: 0.631–0.681), outperforming the
   traditional polygenic risk score (PRS) [0.560 (95%CI: 0.535–0.585)]
   (p = 5.38 × 10^−8) ([109]Figure 5C). The time-dependent AUC of CPRS was
   65.93 (95%CI: 62.91–68.78, ten-year follow-up), outperforming the PRS
   [55.71 (95%CI: 52.67–58.59)] ([110]Figure S7). CPRS was found to have a
   better discrimination power. After adjusting for non-genetic
   confounders (age, gender, BMI, and smoking status), the discrimination
   of CPRS (C-index[95%CI]: 0.777[0.757–0.797]) was still better than PRS
   (C-index[95%CI]: 0.758[0.738–0.778]) (p = 5.87 × 10^−7), all of which
   were outperformed than the model only containing the non-genetic
   predictors (C-index[95%CI]: 0.751[0.731–0.771]) (p = 4.93 × 10^−13, p =
   3.95 × 10^−4). The CPRS model was generally well calibrated than PRS
   model ([111]Figure S8). These results suggest that the CPRS has the
   ability to predict risk of lung cancer and that it potentially
   optimizes the definition of sub-populations at high-risk in
   individualized lung cancer prevention.

Discussion

   In this study, we systematically evaluate causal relationship between
   plasma proteins and lung cancer. We used GWAS summary statistics data
   from the UKB, PLCO, ILCCO-OncoArray, TRICL, and FinnGen, and further
   performed PW-MR analyses based on two large-scale pQTL populations.
   Multi-omics analyses were performed to evaluate the functional evidence
   of identified robust and moderated causal proteins. Moreover, we
   constructed a causal protein risk score for lung cancer based on PW-MR
   meta-analysis and further validated it in UKB Olink proteomics data.

   We identified plasma proteins that were causally related to lung cancer
   and may represent new therapeutic targets for the prevention or
   treatment of lung cancer. Leveraging significant pQTLs in deCODE and
   Fenland, 270 proteins were causally associated with lung cancer.
   Granzyme B (GZMB), the strongest causal proteins, had protective
   effects for lung cancer. GZMB is a serine protease most common in
   cytotoxic lymphocyte and natural killer cells, which can induce
   Gasdermin E (GSDME) dependent pyroptosis in tumor targets to activate
   anti-tumor immunity both directly by cleaving GSDME and indirectly by
   activating caspase 3.[112]^22 Ribonuclease T2 (RNASET2) is an RNase T2
   enzyme that exists in the human body, which is the only extracellular
   nuclease of RNase T2 family.[113]^23 RNASET2 expression was reduced in
   primary ovarian tumors,[114]^24 melanoma,[115]^25 and non-Hodgkin’s
   lymphoma.[116]^26 However, increased risk of lung cancer was associated
   with increased expression of RNASET2,[117]^27 which is consistent with
   our study. Interleukins can nurture an environment enabling and
   favoring cancer growth while simultaneously being essential for a
   productive tumor-directed immune response.[118]^28 In our study, ten
   interleukin-proteins were found to be causally associated with lung
   cancer, three of which were robust causal proteins. Major
   histocompatibility complex (MHC) A/B stress proteins are upregulated in
   response to DNA damage in many types of human cancers but expressed at
   low or undetectable levels by healthy
   cells.[119]^29^,[120]^30^,[121]^31^,[122]^32 MICA/B can induce tumor
   immunity of T cells and nature killer (NK) cells, which shows promise
   in cancer vaccine target as a cancer vaccine target.[123]^33

   We observed strong functional evidence for the identified genes from
   KEGG network in lung cancer tissues and adjacent normal tissues and
   protein-protein interaction network. The signal transduction pathway is
   related to various body functions and involved in some important
   biological processes, including cell proliferation, differentiation,
   apoptosis, immune regulation, and hematopoiesis.[124]^34 The immune
   system is intrinsic to health. By broadly assessing human immune system
   variation and considering interdependencies between immune system
   components, we could provide evidence for cancer prevention or
   treatment by modulating the immune system.[125]^35 The metabolic
   pathway is closely related to tumor initiation and progression, tumor
   microenvironment (TME), which can be depleted of certain nutrients that
   force cancer cells to adapt by inducing nutrient scavenging mechanisms
   to sustain cancer cell proliferation.[126]^36

   The causal proteins improve the ability for lung cancer high-risk
   population identification. It is widely recognized that early screening
   for lung cancer is most likely beneficial when target tumor type has
   relatively uniform biology and a slower rate of progression.[127]^37
   Targeting high-risk populations with appropriate strategies for early
   detection could get remarkable benefits of mortality reduction.[128]^38
   However, the selection of population to be screened is a complex
   procedure, with difficulty accurately identifying high-risk persons who
   are most likely to benefit from screening. Plasma proteomes provide
   insight into contributing biological factors, and we investigated their
   potential value for future lung cancer prediction. By evaluating
   C-index, time-dependent AUC, and risk stratification, we demonstrated
   that proteins had better predictive power than PRS.

   It is possible to combine CPRS with lung cancer screening strategies to
   improve screening efficiency. People with high-risk should be screened
   frequently and regularly (e.g., once every three years), which is
   expected to further reduce the cancer mortality. Therefore, CPRS is
   expected to serve as an informative benchmark to incorporate the PRS
   and baseline information that have been used in cancer risk assessment.

   Our work has several strengths. Firstly, through harmonizing multiple
   large-scale GWASs, we comprehensively evaluated causal relationships
   between plasma proteins and lung cancer. We identified blood proteins
   causally linked with lung cancer through PW-MR-meta. Secondly, we
   explored the relationship between identified proteins and lung cancer
   at multi-omics levels, including genomics, transcriptomics, and
   proteomics, which revealed the identified signals were functional.
   Thirdly, we focused on the high-risk population stratification based on
   proteins, while few studies developed risk scores using causal
   proteins. We demonstrated the stable performance of CPRS across lung
   cancer in the UKB Olink proteomics data, especially for its ability to
   identify high-risk persons. Therefore, the CPRS might be a
   complementary genetic risk assessment tool combined with the existing
   screening guidelines.

   In conclusion, this large-scale GWASs and PW-MR meta-analyses study for
   lung cancer identified plasma proteins causally associated with lung
   cancer as well as pathways related to this disease, which may be
   further explored as possible therapeutic targets for lung cancer.
   Furthermore, this study provides novel insights into population risk
   stratification based on CPRS, which can be used as a valuable
   supplement to existing lung cancer screening strategies.

Limitations of the study

   It is essential to acknowledge the limitations of our study. Firstly,
   although the proteins and CPRS weights were determined using the
   PW-MR-meta information, the observational proteomics replication was
   conducted in a subgroup of UKB Olink proteomics data only. External
   proteomics studies should be conducted to validate these findings.
   Secondly, we focused on European ancestry only. It is essential to
   evaluate the associations of proteins and performance of CPRS in
   non-European populations. Thirdly, we mainly investigated the causal
   protein effects on population risk stratification. However, the
   contribution of environmental factors should not be ignored.
   Well-established risk prediction models incorporated with environmental
   exposure factors, PRS, and CPRS should be developed for lung cancer.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   deCODE Genetics Zheng et al.[129]^21 [130]https://www.decode.com
   Fenland cohort Pietzner et al.[131]^18
   [132]https://www.omicscience.org/apps/pgwas/
   UK Biobank [133]https://www.ukbiobank.ac.uk/
   ILCCO-Oncoarray
   [134]https://www.ncbi.nlm.nih.gov/projects/gap/cg-ibin/study.cgi?study_
   id=phs001273.v3.p2
   TRICL
   [135]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_
   id=phs001681.v1.p1
   Finngen R6 [136]https://www.finngen.fi/
   PLCO
   [137]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi-study_
   id=phs001286.v2.p2
   CPTAC [138]https://pdc.esacinc.com/pdc/pdc
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Minimac (version 4) Minimac4 - Genome Analysis Wiki ([139]umich.edu)
   SAIGE (v1.1.6) Zhou et al.[140]^39
   [141]https://github.com/weizhouUMICH/SAIGE/.
   TwosampleMR R package Hemani et al.[142]^40 [143]https://www.mrbase.org
   coloc R package Giambartolomei et al.[144]^41
   [145]https://github.com/chr1swallace/coloc
   ClusterProfiler R package Yu et al.[146]^42
   [147]https://bioconductor.org/packages/release/bioc/html/clusterProfile
   r.html.
   STRING [148]https://cn.string-db.org/
   PLINK (v 1.9) [149]https://www.cog-genomics.org/plink/
   timeROC R package Github [150]https://github.com/cran/timeROC
   SurvComp R package Github [151]https://github.com/bhklab/survcomp
   [152]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the lead contact, Sipeng Shen
   (sshen@njmu.edu.cn).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       This paper analyzes existing, publicly available data. These access
       URLs for the datasets are listed in the [153]key resources table.
     * •
       This paper does not report original code.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [154]lead contact upon request.

Experimental model and study participant details

   Our study is computational that does not use experimental models
   typical in the life sciences.

Method details

Study population and data collection

UKB

   The UK Biobank (UKB) is a population-based prospective cohort of
   individuals aged 40–69 years, enrolled between 2006 and 2010.[155]^43
   The work described herein was approved by the UK Biobank under
   applications 92675. All the phenotype data were accessed in March 2022.
   Health-related outcomes were ascertained via individual record linkage
   to national cancer and mortality registries and hospital in-patient
   encounters. Cancer diagnoses were coded by International Classification
   of Diseases version 10 (ICD-10) codes. Individuals with at least one
   recorded incident diagnosis of a borderline, in situ, or primary
   malignant cancer were defined as cases collected from data fields 41270
   (Diagnoses - ICD10), 41202 (Diagnoses - main ICD10), 40006 (Type of
   cancer: ICD10), and 40001 (primary cause of death: ICD10). The data
   analyses were performed on DNAnexus Research Analysis Platform (RAP).

   To minimize the possibility of including lung cancer metastasis, we
   excluded lung cancer that occurred within 5 years of different primary
   cancer. In addition, prevalent lung cancer cases diagnosed prior to
   baseline enrollment were excluded. Finally, we analyzed 338,726
   participants of European ancestry with 4,083 primary lung cancer cases
   and 334,643 cancer-free controls.

PLCO

   The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening
   Trial is a large population-based randomized trial designed and
   sponsored by the National Cancer Institute (NCI) to determine the
   effects of screening on cancer-related mortality and secondary
   endpoints in over 150,000 men and women aged 55 to 74.[156]^44
   Participants have been under follow-up for cancer incidence and
   mortality since the completion of screening procedure in 2006. In
   addition, PLCO included a large biological sample biorepository which
   has served as a unique resource for cancer research, particularly for
   etiologic and early-marker studies.

   Lung cancer diagnoses were coded by I International Classification of
   Disease for Oncology version 2 (ICD-O-2). Only primary invasive lung
   cancer diagnosed during the trial were included. Finally, we analyzed
   98,651 participants of European ancestry with 2,455 primary lung cancer
   cases and 96,196 controls.

ILCCO-OncoArray

   OncoArray Consortium is a network created to increase understanding of
   the genetic architecture of common cancers. The OncoArray GWAS was
   originally designed to profiled genotype information of 57,775
   participants, obtained from 29 studies across North America, Europe,
   and Asia.[157]^27 All participants signed the informed consent, and the
   studies were approved by the local internal review boards or ethics
   committees and administered by trained personnel.

   Tumors from patients were classified as adenocarcinomas, squamous
   carcinomas, large-cell carcinomas, mixed adenosquamous carcinomas and
   other NSCLC histologies following either the International
   Classification of Diseases for Oncology (ICD-O) or World Health
   Organisation coding.

FinnGen

   FinnGen (FG) is a public-private partnership project combining
   electronic health record and registry data from six regional and three
   Finnish biobanks.[158]^45^,[159]^46 Participant data (with informed
   consent) include genomics and health records linked to disease
   endpoints. FinnGen participants provided informed consent for biobank
   research. The FinnGen study is approved by Finnish Institute for Health
   and Welfare.

   We used summary-level data from FG participants with completed genetic
   measurements and imputation. Association results for lung cancer and
   cancer-free controls were downloaded (R6 data release).

deCODE genetics

   The deCODE genetics database contains extensive genotype and phenotype
   information. 4,719 proteins for 35,559 Icelanders summary data were
   collected.[160]^47

Fenland cohort

   The Fenland study is a population-based cohort of 12,435 participants
   of Caucasian-ancestry born between 1950 and 1975 who underwent detailed
   phenotyping at the baseline visit from 2005 to 2015. 4,775 proteins for
   10,708 Caucasian summary data were collected.[161]^18

UKB Olink proteomics

   Olink proteomics data was generated in UKB Pharma Proteomics
   Project.[162]^19 We analyzed the initial batch of data which was
   generated using the Olink Explore 1536 platform (1,463 proteins) on
   43,395 European descent participants. Linkages to National disease and
   death registries were used to identify incident lung cancers according
   to ICD10 code of C34. Participants diagnosed with lung cancer prior to
   recruitment were excluded. Additionally, participants without Olink
   proteomics data from recruitment were excluded from all prediction
   model construction. Finally, 490 newly diagnosed lung cancer were
   recruited.

CPTAC

   The National Cancer Institute’s Clinical Proteomic Tumor Analysis
   Consortium (CPTAC) is a national effort to accelerate the understanding
   of the molecular basis of cancer through the application of large-scale
   proteome and genome analysis.[163]^48 Fragments Per Kilobase per
   Million mapped reads (FPKM) values from lung tissue RNA sequencing data
   were logarithmically transformed. The Tandem Mass Tag (TMT) labeled
   global proteome was analyzed using ThermoFisher mass spectrometer
   (Thermo Scientific).[164]^49 After filtering, 195 pairs of genes
   expression and 200 pairs of proteins abundance of participants’ tumor
   and adjacent normal tissues were analyzed.

Quality control for the SNP array data

   The genotype data of UKB were imputed into the dataset using
   computationally efficient methods combined with the Haplotype Reference
   Consortium (HRC) and UK10K haplotype resource. Details of the genotype
   data were described in data-fields 22418 and 22828.

   The genotype data of PLCO were generated from the Illumina GSA (673,132
   markers), Oncoarray (474,276 markers), and historical data including
   Illumina OmniExpress (OmniX) (715,823 markers), Omni2.5M (Omni25)
   (2,310,570 markers) and Human Quad 610 (580,912 markers) SNP arrays.
   The duplicated samples between different platforms were removed.

   Genotyping of 533,631 SNPs in ILCCO-OncoArray was completed at the
   Center for Inherited Disease Research, the Beijing Genome Institute,
   the Helmholtz Zentrum München, Copenhagen University Hospital and the
   University of Cambridge in Illumina Infinium OncoArray
   platform.[165]^50 Before standard quality control, we removed the
   intentionally duplicated samples and samples from unrelated OncoArray
   studies and HapMap control individuals of European, African, Chinese
   and Japanese origins.

   The genotype data of TRICL were generated from the Affymetrix Axiom
   Array containing 414,504 markers, which was a custom panel of key LC
   GWAS markers, and rare coding SNVs and indels.

   Further excluded were those who lacked disease status, were
   second-degree relatives or closer having identity by descent
   (IBD) > 0.2 or had low-quality DNA (call rate < 95%), or sex
   inconsistency, or were non-European. SNPs were removed if meeting any
   of the following criteria: (1) sex chromosome, (2) minor allele
   frequency (MAF) < 0.05, (3) call rate < 95%, and (4) Hardy-Weinberg
   equilibrium (HWE) test P < 1.00×10^−7 in controls or P < 1.00×10^−12 in
   cases.

   The summary-level data of FinnGen cohort was based on the chip genotype
   data with Illumina and Affymetrix arrays, and then imputed by using the
   population-specific SISu v3 imputation reference panel.

Data processing of the deCODE Genetics and Fenland cohort

   deCODE Genetics sequenced the whole genome of 49,708 Icelanders to a
   median of 32× using Illumina technology.[166]^47 In total, deCODE
   Genetics has genotyped 166,281 Icelanders with Illumina SNP chips,
   long-range phased and imputed based on the sequenced dataset, of which
   35,559 had proteins measured by SomaScan version 4. Aptamers for
   non-human proteins and aptamers listed as deprecated by SomaScan as
   well as aptamers mapping to multiple genes were excluded, leaving 4,907
   aptamers target a total of 4,719 unique proteins.

   Fenland participants were genotyped using three genotyping arrays:
   Illumina Infinium Core Exome 24v1, Affymetrix SNP 5.0, and Affymetrix
   UK Biobank Axiom array for 12,435 participants of White British
   ancestry, which resulted in > 17 million genotyped or imputed variants
   with a minor allele frequency of > 0.1%. Relative protein abundances of
   4,775 human protein targets were evaluated by 4,979 aptamers (SomaLogic
   V4), which only included human protein targets and SOMAmers deemed
   acceptable by SomaLogic and that had a value between 0.25 and 4 for all
   scaling factors.[167]^14 The 4775 proteins were used to perform the
   association study.

Imputation based on TOPMed imputation server

   To estimate missing genotype information, we performed an imputation on
   the TOPMed online imputation server, which phased haplotypes with Eagle
   v2.4[168]^51 using TOPMed data (Version r2) as a reference panel that
   included 97,256 reference samples and 308,107,085 genetic
   variants.[169]^52 The server performed imputations for PLCO,
   ILCCO-OncoArray, and TRICL using Minimac (version 4) software.

   All genetic variants were lifted to GRCh38 coordinates to maintain
   consistency with the UKB WES project. Poorly imputed SNVs with
   imputation quality score R^2 < 0.4 and SNVs on sex chromosomes were
   excluded from the analyses.

Association analyses of single-variant level

   We performed single-variant association tests for common variants (MAF
   ≥ 0.01) using Scalable and Accurate Implementation of GEneralized mixed
   model SAIGE (v1.1.6).[170]^39 SAIGE is a toolkit developed for
   genome-wide association tests in biobank-level datasets that uses
   saddlepoint approximation to handle extreme case-control imbalances of
   binary traits and linear mixed models to account for sample
   relatedness.

Proteome-wide MR (PW-MR) analyses

   We performed Proteome-Wide Mendelian Randomization (PW-MR) based on
   cis- and trans pQTL from deCODE and Fenland, and GWAS summary data from
   UKB, PLCO, ILCCO-OncoArray, and TRICL, and FinnGen. For instrument
   variable selection, we retained SNPs which were genome-wide significant
   (both cis-SNP and trans-SNP, P < 5.0 × 10^-8) in each pQTL in our
   study. In order to minimize correlated horizontal pleiotropy, we
   retained SNPs independent of each other (LD windows 10000kb, R^2 < 0.1
   in 1000G) in pQTLs. To quantify the statistical power of the pQTLs,
   strength of SNPs was evaluated by F-statistics, where F-statistics
   threshold ≥ 10 of IV indicates sufficient statistical strength.[171]^53
   If any IVs had F-statistics < 10, we considered those to have limited
   power (potentially causing weak instrument bias[172]^54 and removed
   these from the MR.

Bayesian colocalization analysis using COLOC

   To further investigate whether association signals with robust causal
   proteins and lung cancer were derived by shared causal variant, we
   performed a Bayesian colocalization analysis of 270 causal proteins and
   lung cancer in deCODE and Fenland. Colocalization analysis relies on a
   single causal variant assumption, and the posterior probability (PP)
   for five hypotheses at each pleiotropic locus is provided: (i) H[0]:
   neither trait has a genetic association in the region; (ii) H[1]: only
   trait 1 has a genetic association in the region; (iii) only trait 2 has
   a genetic association in the region; (iv) both traits are associated,
   but with different causal variants; (v) both traits are associated and
   shar a single causal variant. The prior probabilities were set as
   p[1]=10^-4, p[2]=10^-4, p[12]=10^-5. For each robust causal protein,
   regions were defined as area within 500kb of selected variants. The
   posterior probability for shared a causal variant (PP.H4) >0.7 was
   considered to have a strong support of colocalization. Medium
   colocalization indication was defined as 0.5< PP.H4 <0.7.

Comparison analyses for gene expression or protein abundance in tumor and
adjacent normal tissues

   In our study, on the tissue level, we integrating transcriptomic and
   proteomic measurements to validate whether MR-identified robust and
   moderate causal proteins and protein-coding genes were observably
   associated with lung cancer. The paired gene expression and protein
   abundance data tumor and adjacent normal tissues was collected from
   CPTAC. The protein abundances were further grouped by unique gene names
   using sum of all the protein abundances belonging to the identical gene
   name.

Pathway enrichment analysis

   In our study, pathway enrichment analysis was performed to identifies
   specific biological pathways as being particularly abundant in a list
   of protein-coding genes of MR-identified robust and moderate causal
   proteins. We collected the pathway information with gene sets from the
   KEGG database, containing a total of 213 pathways. All enrichment
   analyses were performed using the R package clusterProfiler.[173]^42

Protein-protein interaction analysis

   To further understand the protein-protein interactions, we used the
   STRING database, which considered both physical interactions as well as
   functional associations.[174]^55 The protein interaction network was
   clustered into different colors using Markov Clustering (MCL).

Development and validation of the risk score based on causal proteins

   We developed a causal-protein risk score (CPRS) for population risk
   stratification based on lung cancer causal proteins. To perform
   independent validation phase, the PW-MR-meta causal effect defined as
   weights of selected proteins. The CPRS was constructed included robust
   and moderate causal proteins (FDR-q < 0.05 in one data set,
   FDR-q < 0.05 or P < 0.05 in another) in the PW-MR-meta analysis with
   consistent direction of observational (Cox proportional hazards model)
   effect and causal effect.

   The CPRS was generated as:
   [MATH: <mrow><mi>C</mi><mi>P</mi><mi>R</mi><mi>S</mi><mo
   linebreak="goodbreak"
   linebreakstyle="after">=</mo><munderover><mo>∑</mo><mn>1</mn><mi>n</mi>
   </munderover><msub><mi>β</mi><mi>i</mi></msub><msub><mi>P</mi><mi>i</mi
   ></msub></mrow> :MATH]
   , where
   [MATH: <mrow><msub><mi>β</mi><mi>i</mi></msub></mrow> :MATH]
   denoted the coefficient of the i^th protein P[i] calculated by
   PW-MR-meta ((Beta[deCODE] + Beta[Fenland])/2).

   In the validation phase, the protein panel with previously determined
   weights was used to generate CPRS in the UKB Olink proteomics data.

   We used person-year to describe the absolute lung cancer incidence
   risk, which was defined as the time gap from the data of cohort
   enrollment to lung cancer diagnosis or the last follow-up, whichever
   came first.

Polygenic risk score generation

   Polygenic risk score (PRS) was constructed as the sum of the number of
   minor alleles of SNPs participants carries, weighted by their effect
   size as Iog-odds ratio. We used PRS-128 (128 SNPs) to generate
   PRS[175]^56^,[176]^57 based on UKB Olink proteomics data using PLINK
   1.9 ([177]https://www.cog-genomics.org/plink/). These SNPs were
   collected from the known susceptibility loci of lung cancer and
   conditions related to lung cancer (such as lung function impairment)
   previously identified through literature curation and NHGRI-EBI GWAS
   Catalog ([178]https://www.ebi.ac.uk/gwas/), and additional loci that
   passed the suggestive significance-level in GWAS studies. When
   correlation exists, variants representing independent loci with the
   strongest statistical significance were retained.

Qualification and statistical analysis

Statistics and software

   The genome-wide association testing of 4,719 plasma protein were
   preformed using linear mixed model with 27.2 million imputed variants
   as genotypes after adjusting rank-inverse normal transformed levels for
   age, sex, and sample age for the deCODE Health study. The
   likelihood-ratio test to compute all P values.[179]^47

   PW-MR analyses were performed using Wald ratio (proteins with only one
   available SNP) or inverse variance weighted (IVW)
   method[180]^58^,[181]^59 for all other proteins by TwoSampleMR
   package[182]^40 to estimate the causal effect of blood proteins on lung
   cancer. MR Steiger test of causality directionality was performed using
   ‘directionality_test’ function in TwoSampleMR package.[183]^40 The
   presence of pleiotropy was further investigated using MR-PRESSO and
   MR-Egger method[184]^60 to estimate the potential effect of
   pleiotropy.[185]^61 For MR-Egger intercept P value<0.05, we considered
   these protein-disease signals as influenced by horizontal pleiotropy.
   We also applied Cochran’s Q test to estimate the potential
   heterogeneity of MR estimates (P value<0.05). Proteins with horizontal
   pleiotropy and heterogeneity were excluded from any of the follow-up
   analyses. The results of PW-MR were summarized by meta-analysis
   (PW-MR-meta). A 5% FDR correction threshold was applied to correct for
   multiple testing. Bayesian colocalization was performed using the
   “coloc” R package.[186]^41

   Paired t-test (FDR-q < 0.05) and fold change (FC > 1.5 or FC < 0.5)
   were used to identify the differential gene expression or protein
   abundance.

   Cox proportional hazards models Hazard ratios were used to evaluated
   the association between CPRS, PRS and lung cancer risk adjusting for
   age, sex, body mass index (BMI), and smoking status. (HRs) and 95%
   confidence interval (CI) were calculated. Participants were classified
   into ten equal parts according to the distribution of CPRS and PRS,
   respectively. And we compared hazard ratios (HR) for each part with
   those at the lowest tenth. Individuals within the top 10%, 10%-90%, and
   the bottom 10% of CPRS and PRS were considered as populations at high,
   intermediate, and low genetic risk respectively. We also used Cox
   regression to calculate and compared cumulative incidence of CPRS and
   PRS for lung cancer risk in each three subgroups, respectively. The
   discrimination performance of the risk scores were evaluated by
   Harrell’s C-index and the time-dependent area under receiver operating
   characteristic curve (AUC) using the R package timeROC. We used 10-fold
   internal cross-validation (repeat 500 times) that adjusted AUC for
   overfitting. Student t-test was performed to analysis the differences
   of Harrell’s C-index using the “cindex.comp” function of R package
   SurvComp.[187]^62 All data analyses were performed using R software
   (version 4.2.3, [188]https://www.r-project.org/).

Acknowledgments