Graphical abstract graphic file with name fx1.jpg [47]Open in a new tab Highlights * • We evaluated the causal relationship between proteins and lung cancer in European * • 39 robust causal proteins were identified which represented potential drug targets * • We developed a CPRS, a promising approach for lung cancer risk stratification __________________________________________________________________ Bioinformatics; Cancer; Omics Introduction Lung cancer is the first leading cause of cancer death overall,[48]^1 and most lung cancer patients are diagnosed at late-stage where curative treatment is rarely possible.[49]^2 Lung cancer is a multifactorial malignant disease driven by environmental exposure,[50]^3 genetic factors,[51]^4 and multi-omics biomarkers.[52]^5 While the environmental and genetic polymorphisms have been widely recognized, it is crucial to explore the downstream biomarkers based on the genetic central dogma to uncover the complex disease.[53]^6^,[54]^7 A variety of biomarkers have been shown to aid in the early diagnosis of lung cancer,[55]^8^,[56]^9 but there are still unmet clinical needs and technical challenges. Proteins function as crucial hubs between genetics and phenotypes, and the complex protein-protein interactions that mediate signaling pathways and biological processes are essential causes of lung cancer etiology.[57]^10^,[58]^11 Recently, proteomics-based risk model showed promise in predicting incident lung cancer.[59]^12^,[60]^13 Naturally genetic variation, either in close physical proximity to the protein-encoding gene (cis-) or anywhere else in the genome (trans-), has wide-ranging effects on protein structure and function, with important implications for complex diseases.[61]^14 However, few studies have investigated the causal proteins in lung cancer based on large-scale population protein quantitative trait loci (pQTLs).[62]^15^,[63]^16 Thus, leveraging data from human blood proteome in Mendelian randomization (MR) studies facilitates deeper characterization of circulating proteins causally associated with lung cancer,[64]^17^,[65]^18 which identifying proteins causally related to lung cancer could improve our understanding of the lung cancer genetic structure, identifying drug targets, and high-risk screening evidence for low-dose computed tomography (LDCT) screening. Additionally, high-quality proteomic studies, such as the UK Biobank (UKB) Olink project, will provide additional scientific opportunities.[66]^19 To systematically identify causal proteins, we investigated large-scale genome-wide association studies (GWASs) with lung cancer in five databases,[67]^20 including UKB, the prostate, lung, colorectal, and ovarian (PLCO) cancer screening trial, the International Lung Cancer OncoArray Consortium (ILCCO-OncoArray), the transdisciplinary research in cancer of the lung (TRICL) research team, and FinnGen. Leveraging pQTL summary statistics from the deCODE Genetics and Fenland cohorts, we performed meta-analysis on proteome-wide MR (PW-MR) studies and identified blood proteins causally linked to lung cancer. Finally, we aim to develop an effective tool, causal-protein risk score, to predict lung cancer risk and improve eligibility criteria for lung cancer screening programs. Results Proteome-wide Mendelian randomization studies of lung cancer Using pQTLs from 35,559 Icelanders in deCODE Genetics and 10,708 Europeans in Fenland, we investigated the causal association of 4,719 and 4,775 plasma proteins with lung cancer, separately. Genetic association statistics of lung cancer were obtained from five cohorts (30,312 LC cases and 652,902 controls from UKB, PLCO, ILCCO-OncoArray, TRICL, and FinnGen) ([68]Table S1). We performed a meta-analysis of PW-MR (PW-MR-meta) based on the two pQTLs and lung cancer genetic associations ([69]Table S2). The workflow chart for the study design was illustrated in [70]Figure 1. The PW-MR-meta identified 270 unique causal plasma proteins significantly associated with lung cancer (FDR-q < 0.05), and 7 overlaps were found with 351 protein-coding genes of lung cancer GWAS catalog provided by NHGRI-EBI ([71]www.ebi.ac.uk/gwas/) ([72]Figure 2 and [73]Table S3). Among them, 182 proteins were identified in deCODE, and 127 proteins were identified in Fenland ([74]Figures 3A, 3B, and [75]S1). What’s more, we found 39 robust causal proteins (FDR-q < 0.05 in both datasets, [76]Figures 3C and 3D), 78 moderate causal proteins (FDR-q < 0.05 in one dataset and p < 0.05 in another) and 153 general causal proteins (FDR-q < 0.05 in any dataset). Bayesian colocalization showed that 3 causal proteins (PRSS27, TAPBPL, and PSG5) had a shared single causal signal (colocalization PP.H4 > 0.7) in deCODE. 2 causal proteins (SERPING1 and GAA) had a shared single causal signal (colocalization PP.H4 > 0.7), and 2 proteins (TLR3 and MICA) had medium moderate support of colocalization (0.5 < PP.H4 < 0.7) in Fenland ([77]Table S4). We further analyzed the effect of the top 10 significantly robust casual proteins SNPs on lung cancer from different GWAS data, which showed consistent effect direction ([78]Figure S2). In Mendelian randomization results from the Proteome PheWAS browser,[79]^21 12 of 39 proteins were available and examined in our study, and seven were significantly associated with lung cancer or lung adenocarcinoma or lung squamous cell carcinoma (p < 0.05) ([80]Figure S3 and [81]Table S5). Figure 1. [82]Figure 1 [83]Open in a new tab Study workflow Figure 2. [84]Figure 2 [85]Open in a new tab PhenoGram of PW-MR meta studies significant associations The blue dots represent the general causal association proteins, the green dots represent the moderate causal association proteins, the red dots represent the robust casual association proteins. The dots represent PW-MR significance, and the diamonds represent both PW-MR and GWASs significance. Figure 3. [86]Figure 3 [87]Open in a new tab PW-MR meta studies of lung cancer Manhattan plot of lung cancer PW-MR meta study using (A) deCODE and (B) Fenland data (The most significant 20 were labeled). The red dashed line represents FDR-q <0.05. (C) The dot plot of the odds ratios (ORs) and 95% confidence intervals (CIs) of robust causal proteins (FDR-q <0.05 both dataset). (D) Veen diagram depicting proteins associated with lung cancer in deCODE only, in Fenland only, or in both. Multi-omics analysis for the identified proteins We performed a multi-omics analysis to integrate the identified robust and moderate causal proteins into transcriptomics and proteomics from CPTAC. 110 unique protein-coding genes and 81 proteins that passed quality control were included. We found that the expression level of 33 protein-coding genes were different (FDR-q < 0.05, FC > 1.5 or FC < 0.5). The well-known interleukin family and immune-related genes, CDH3, IL20RB, IL36A, CD27, CD109, and CCL19 were upregulated in tumor tissues, while IL3RA, IL18R1, and CDH5 were downregulated ([88]Figure 4A and [89]Table S6). Figure 4. [90]Figure 4 [91]Open in a new tab Main results of the multi-omics analysis for the identified proteins (A) Volcano plot for the FC values and -log(P) values for comparison of gene expression. (B) Volcano plot for the FC values and -log(P) values for comparison of protein abundance. (C) KEGG pathway network from the enrichment analysis of the robust and moderate causal proteins. (D) Protein-protein interaction network of the robust and moderate causal proteins. Differences in abundance was observed in 27 proteins in tumor and adjacent normal tissues (FDR-q < 0.05, FC > 1.5 or FC < 0.5). Interestingly, some proteins showed similar patterns with the corresponding gene expression, such as CCL19, CDH3, CDH5, and IL3RA. However, reverse trends were found in some proteins. For example, CD109 was downregulated in tumor tissues ([92]Figures 4B and [93]Table S7). Further, we performed KEGG pathway enrichment analysis for the robust and moderate causal proteins. Inflammation- and immune-related pathways were identified, such as cytokine-cytokine receptor interaction (p = 7.05 × 10^−8) and viral protein interaction with cytokine and cytokine receptor (p = 3.63 × 10^−5), as well as the metabolic- and classical cancer-related pathways, including glycosphingolipid biosynthesis—globo and isoglobo series (p = 3.95 × 10^−4), and PI3K-Akt signaling pathway (p = 1.78 × 10^−2), ([94]Figures 4C and [95]Table S8). Using the STRING database to integrate robust and moderate causal proteins-protein interactions, we identified two main clusters: the first cluster was related to the signal transduction (e.g., GZMB, CXCL12, and FASLG) and immune system (e.g., IL1B, CCL19, and CD27); the second cluster was related to metabolic pathways (e.g., C1GALT1C1 and ST3GAL1) ([96]Figure 4D). Causal proteins could identify high-risk population For the CPRS and PRS development and validation, a total of 43,395 European with Olink proteomics data in UKB were included in the prediction study. Among 117 robust and moderate causal proteins (FDR-q < 0.05 in one dataset, FDR-q < 0.05 or p < 0.05 in another) in PW-MR-meta analysis, 41 significant proteins were available in the UKB individual-level data. Finally, 16 proteins, with consistent direction of observational effect and causal effect, were selected to construct CPRS to identify high-risk populations for lung cancer incidence ([97]Tables S9 and [98]S10, and [99]Figure S4). These proteins were mainly enriched in inflammatory, metabolic, and neurological categories ([100]Figure S5). The CPRS could stratify the lung cancer absolute incidence risk significantly in overall (log rank p < 2.20 × 10^−16) and ever-smokers (log rank p < 2.20 × 10^−16) in the UKB cohort. Further, all subjects were categorized into 10 groups by the deciles of CPRS and PRS, respectively. Compared with low-risk group (in the lowest tenth of the CPRS), subjects at a high genetic risk group (in the top tenth of the CPRS) were at significantly higher risk of lung cancer with a hazard ratio (HR) of 4.33 (95%CI: 2.65–7.06, p < 4.32 × 10^−9) for overall ([101]Figure S6A) and 5.51 (95%CI: 3.24–9.38, p < 3.04 × 10^−10) for ever-smokers ([102]Figure S6B), which outperformed PRS [top 10% versus bottom 10%: HR = 2.59 (95%CI: 1.66–4.05) for overall; HR = 2.59 (95%CI: 1.62–4.17) for ever-smokers]. A cumulative effect of the CPRS and PRS was observed for incident lung cancer according to the results from UKB. Compared with low-risk population (in the lowest tenth of the CPRS), the high-risk persons (in the top tenth of the CPRS) had a hazard ratio (HR) of 4.21 (95% CI: 2.58–6.87) ([103]Figure 5A). Similar prediction results were observed in smokers, for ever-smoking population, the HR of high-risk persons was 5,34 (95% CI: 3.14–9.08) ([104]Figure 5B). Meanwhile, the HRs of PRS were 2.59 (95% CI: 1.66–4.05) and 2.59 (95% CI: 1.61–4.17), respectively. Thus, compared with PRS, the CPRS had a more satisfactory performance in lung cancer risk stratification. For CPRS, participants at low-risk had a lower rate (35.92 per 100,000 person-years) of lung cancer compared to that (232.24 per 100,000 person-years) of participants with high risk. For PRS, participants at low-risk had a lower rate (50.98 per 100,000 person-years) of lung cancer compared to that (127.08 per 100,000 person-years) of participants with high risk ([105]Figure 5A). Besides, among CPRS group, a much higher cumulative lung cancer incident was observed among ever-smokers at high genetic risk compared with low-risk participants (336.20 vs. 50.18 per 100,000 person-years). Similar with CPRS group, top tenth of PRS group had much higher cumulative lung cancer incidents (189.15 vs. 75.11 per 100,000 person-years) ([106]Figure 5B). Figure 5. [107]Figure 5 [108]Open in a new tab Main results of discrimination evaluation for CPRS and PRS (A) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS (dotted line) in the overall UKB individual protein data. (B) Cumulative lung cancer incidence plot for CPRS (solid line) and PRS (dotted line) in the smokers UKB individual protein data. The red line indicates the high-risk persons, the yellow line indicates the intermediate-risk population, and the blue line indicates the low-risk persons. Hazard ratios and the 95% confidence intervals derived from Cox regression model adjusting for age, sex, BMI, and smoking status are provided in legend. (C) The C-index values of CPRS and PRS generated by the Cox regression model. Further, we evaluated the discrimination abilities of CPRS and PRS using the C-index and time-dependent AUC for lung cancer incidence. The C-index of CPRS was 0.656 (95%CI: 0.631–0.681), outperforming the traditional polygenic risk score (PRS) [0.560 (95%CI: 0.535–0.585)] (p = 5.38 × 10^−8) ([109]Figure 5C). The time-dependent AUC of CPRS was 65.93 (95%CI: 62.91–68.78, ten-year follow-up), outperforming the PRS [55.71 (95%CI: 52.67–58.59)] ([110]Figure S7). CPRS was found to have a better discrimination power. After adjusting for non-genetic confounders (age, gender, BMI, and smoking status), the discrimination of CPRS (C-index[95%CI]: 0.777[0.757–0.797]) was still better than PRS (C-index[95%CI]: 0.758[0.738–0.778]) (p = 5.87 × 10^−7), all of which were outperformed than the model only containing the non-genetic predictors (C-index[95%CI]: 0.751[0.731–0.771]) (p = 4.93 × 10^−13, p = 3.95 × 10^−4). The CPRS model was generally well calibrated than PRS model ([111]Figure S8). These results suggest that the CPRS has the ability to predict risk of lung cancer and that it potentially optimizes the definition of sub-populations at high-risk in individualized lung cancer prevention. Discussion In this study, we systematically evaluate causal relationship between plasma proteins and lung cancer. We used GWAS summary statistics data from the UKB, PLCO, ILCCO-OncoArray, TRICL, and FinnGen, and further performed PW-MR analyses based on two large-scale pQTL populations. Multi-omics analyses were performed to evaluate the functional evidence of identified robust and moderated causal proteins. Moreover, we constructed a causal protein risk score for lung cancer based on PW-MR meta-analysis and further validated it in UKB Olink proteomics data. We identified plasma proteins that were causally related to lung cancer and may represent new therapeutic targets for the prevention or treatment of lung cancer. Leveraging significant pQTLs in deCODE and Fenland, 270 proteins were causally associated with lung cancer. Granzyme B (GZMB), the strongest causal proteins, had protective effects for lung cancer. GZMB is a serine protease most common in cytotoxic lymphocyte and natural killer cells, which can induce Gasdermin E (GSDME) dependent pyroptosis in tumor targets to activate anti-tumor immunity both directly by cleaving GSDME and indirectly by activating caspase 3.[112]^22 Ribonuclease T2 (RNASET2) is an RNase T2 enzyme that exists in the human body, which is the only extracellular nuclease of RNase T2 family.[113]^23 RNASET2 expression was reduced in primary ovarian tumors,[114]^24 melanoma,[115]^25 and non-Hodgkin’s lymphoma.[116]^26 However, increased risk of lung cancer was associated with increased expression of RNASET2,[117]^27 which is consistent with our study. Interleukins can nurture an environment enabling and favoring cancer growth while simultaneously being essential for a productive tumor-directed immune response.[118]^28 In our study, ten interleukin-proteins were found to be causally associated with lung cancer, three of which were robust causal proteins. Major histocompatibility complex (MHC) A/B stress proteins are upregulated in response to DNA damage in many types of human cancers but expressed at low or undetectable levels by healthy cells.[119]^29^,[120]^30^,[121]^31^,[122]^32 MICA/B can induce tumor immunity of T cells and nature killer (NK) cells, which shows promise in cancer vaccine target as a cancer vaccine target.[123]^33 We observed strong functional evidence for the identified genes from KEGG network in lung cancer tissues and adjacent normal tissues and protein-protein interaction network. The signal transduction pathway is related to various body functions and involved in some important biological processes, including cell proliferation, differentiation, apoptosis, immune regulation, and hematopoiesis.[124]^34 The immune system is intrinsic to health. By broadly assessing human immune system variation and considering interdependencies between immune system components, we could provide evidence for cancer prevention or treatment by modulating the immune system.[125]^35 The metabolic pathway is closely related to tumor initiation and progression, tumor microenvironment (TME), which can be depleted of certain nutrients that force cancer cells to adapt by inducing nutrient scavenging mechanisms to sustain cancer cell proliferation.[126]^36 The causal proteins improve the ability for lung cancer high-risk population identification. It is widely recognized that early screening for lung cancer is most likely beneficial when target tumor type has relatively uniform biology and a slower rate of progression.[127]^37 Targeting high-risk populations with appropriate strategies for early detection could get remarkable benefits of mortality reduction.[128]^38 However, the selection of population to be screened is a complex procedure, with difficulty accurately identifying high-risk persons who are most likely to benefit from screening. Plasma proteomes provide insight into contributing biological factors, and we investigated their potential value for future lung cancer prediction. By evaluating C-index, time-dependent AUC, and risk stratification, we demonstrated that proteins had better predictive power than PRS. It is possible to combine CPRS with lung cancer screening strategies to improve screening efficiency. People with high-risk should be screened frequently and regularly (e.g., once every three years), which is expected to further reduce the cancer mortality. Therefore, CPRS is expected to serve as an informative benchmark to incorporate the PRS and baseline information that have been used in cancer risk assessment. Our work has several strengths. Firstly, through harmonizing multiple large-scale GWASs, we comprehensively evaluated causal relationships between plasma proteins and lung cancer. We identified blood proteins causally linked with lung cancer through PW-MR-meta. Secondly, we explored the relationship between identified proteins and lung cancer at multi-omics levels, including genomics, transcriptomics, and proteomics, which revealed the identified signals were functional. Thirdly, we focused on the high-risk population stratification based on proteins, while few studies developed risk scores using causal proteins. We demonstrated the stable performance of CPRS across lung cancer in the UKB Olink proteomics data, especially for its ability to identify high-risk persons. Therefore, the CPRS might be a complementary genetic risk assessment tool combined with the existing screening guidelines. In conclusion, this large-scale GWASs and PW-MR meta-analyses study for lung cancer identified plasma proteins causally associated with lung cancer as well as pathways related to this disease, which may be further explored as possible therapeutic targets for lung cancer. Furthermore, this study provides novel insights into population risk stratification based on CPRS, which can be used as a valuable supplement to existing lung cancer screening strategies. Limitations of the study It is essential to acknowledge the limitations of our study. Firstly, although the proteins and CPRS weights were determined using the PW-MR-meta information, the observational proteomics replication was conducted in a subgroup of UKB Olink proteomics data only. External proteomics studies should be conducted to validate these findings. Secondly, we focused on European ancestry only. It is essential to evaluate the associations of proteins and performance of CPRS in non-European populations. Thirdly, we mainly investigated the causal protein effects on population risk stratification. However, the contribution of environmental factors should not be ignored. Well-established risk prediction models incorporated with environmental exposure factors, PRS, and CPRS should be developed for lung cancer. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Deposited data __________________________________________________________________ deCODE Genetics Zheng et al.[129]^21 [130]https://www.decode.com Fenland cohort Pietzner et al.[131]^18 [132]https://www.omicscience.org/apps/pgwas/ UK Biobank [133]https://www.ukbiobank.ac.uk/ ILCCO-Oncoarray [134]https://www.ncbi.nlm.nih.gov/projects/gap/cg-ibin/study.cgi?study_ id=phs001273.v3.p2 TRICL [135]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_ id=phs001681.v1.p1 Finngen R6 [136]https://www.finngen.fi/ PLCO [137]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi-study_ id=phs001286.v2.p2 CPTAC [138]https://pdc.esacinc.com/pdc/pdc __________________________________________________________________ Software and algorithms __________________________________________________________________ Minimac (version 4) Minimac4 - Genome Analysis Wiki ([139]umich.edu) SAIGE (v1.1.6) Zhou et al.[140]^39 [141]https://github.com/weizhouUMICH/SAIGE/. TwosampleMR R package Hemani et al.[142]^40 [143]https://www.mrbase.org coloc R package Giambartolomei et al.[144]^41 [145]https://github.com/chr1swallace/coloc ClusterProfiler R package Yu et al.[146]^42 [147]https://bioconductor.org/packages/release/bioc/html/clusterProfile r.html. STRING [148]https://cn.string-db.org/ PLINK (v 1.9) [149]https://www.cog-genomics.org/plink/ timeROC R package Github [150]https://github.com/cran/timeROC SurvComp R package Github [151]https://github.com/bhklab/survcomp [152]Open in a new tab Resource availability Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Sipeng Shen (sshen@njmu.edu.cn). Materials availability This study did not generate new unique reagents. Data and code availability * • This paper analyzes existing, publicly available data. These access URLs for the datasets are listed in the [153]key resources table. * • This paper does not report original code. * • Any additional information required to reanalyze the data reported in this paper is available from the [154]lead contact upon request. Experimental model and study participant details Our study is computational that does not use experimental models typical in the life sciences. Method details Study population and data collection UKB The UK Biobank (UKB) is a population-based prospective cohort of individuals aged 40–69 years, enrolled between 2006 and 2010.[155]^43 The work described herein was approved by the UK Biobank under applications 92675. All the phenotype data were accessed in March 2022. Health-related outcomes were ascertained via individual record linkage to national cancer and mortality registries and hospital in-patient encounters. Cancer diagnoses were coded by International Classification of Diseases version 10 (ICD-10) codes. Individuals with at least one recorded incident diagnosis of a borderline, in situ, or primary malignant cancer were defined as cases collected from data fields 41270 (Diagnoses - ICD10), 41202 (Diagnoses - main ICD10), 40006 (Type of cancer: ICD10), and 40001 (primary cause of death: ICD10). The data analyses were performed on DNAnexus Research Analysis Platform (RAP). To minimize the possibility of including lung cancer metastasis, we excluded lung cancer that occurred within 5 years of different primary cancer. In addition, prevalent lung cancer cases diagnosed prior to baseline enrollment were excluded. Finally, we analyzed 338,726 participants of European ancestry with 4,083 primary lung cancer cases and 334,643 cancer-free controls. PLCO The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial is a large population-based randomized trial designed and sponsored by the National Cancer Institute (NCI) to determine the effects of screening on cancer-related mortality and secondary endpoints in over 150,000 men and women aged 55 to 74.[156]^44 Participants have been under follow-up for cancer incidence and mortality since the completion of screening procedure in 2006. In addition, PLCO included a large biological sample biorepository which has served as a unique resource for cancer research, particularly for etiologic and early-marker studies. Lung cancer diagnoses were coded by I International Classification of Disease for Oncology version 2 (ICD-O-2). Only primary invasive lung cancer diagnosed during the trial were included. Finally, we analyzed 98,651 participants of European ancestry with 2,455 primary lung cancer cases and 96,196 controls. ILCCO-OncoArray OncoArray Consortium is a network created to increase understanding of the genetic architecture of common cancers. The OncoArray GWAS was originally designed to profiled genotype information of 57,775 participants, obtained from 29 studies across North America, Europe, and Asia.[157]^27 All participants signed the informed consent, and the studies were approved by the local internal review boards or ethics committees and administered by trained personnel. Tumors from patients were classified as adenocarcinomas, squamous carcinomas, large-cell carcinomas, mixed adenosquamous carcinomas and other NSCLC histologies following either the International Classification of Diseases for Oncology (ICD-O) or World Health Organisation coding. FinnGen FinnGen (FG) is a public-private partnership project combining electronic health record and registry data from six regional and three Finnish biobanks.[158]^45^,[159]^46 Participant data (with informed consent) include genomics and health records linked to disease endpoints. FinnGen participants provided informed consent for biobank research. The FinnGen study is approved by Finnish Institute for Health and Welfare. We used summary-level data from FG participants with completed genetic measurements and imputation. Association results for lung cancer and cancer-free controls were downloaded (R6 data release). deCODE genetics The deCODE genetics database contains extensive genotype and phenotype information. 4,719 proteins for 35,559 Icelanders summary data were collected.[160]^47 Fenland cohort The Fenland study is a population-based cohort of 12,435 participants of Caucasian-ancestry born between 1950 and 1975 who underwent detailed phenotyping at the baseline visit from 2005 to 2015. 4,775 proteins for 10,708 Caucasian summary data were collected.[161]^18 UKB Olink proteomics Olink proteomics data was generated in UKB Pharma Proteomics Project.[162]^19 We analyzed the initial batch of data which was generated using the Olink Explore 1536 platform (1,463 proteins) on 43,395 European descent participants. Linkages to National disease and death registries were used to identify incident lung cancers according to ICD10 code of C34. Participants diagnosed with lung cancer prior to recruitment were excluded. Additionally, participants without Olink proteomics data from recruitment were excluded from all prediction model construction. Finally, 490 newly diagnosed lung cancer were recruited. CPTAC The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis.[163]^48 Fragments Per Kilobase per Million mapped reads (FPKM) values from lung tissue RNA sequencing data were logarithmically transformed. The Tandem Mass Tag (TMT) labeled global proteome was analyzed using ThermoFisher mass spectrometer (Thermo Scientific).[164]^49 After filtering, 195 pairs of genes expression and 200 pairs of proteins abundance of participants’ tumor and adjacent normal tissues were analyzed. Quality control for the SNP array data The genotype data of UKB were imputed into the dataset using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. Details of the genotype data were described in data-fields 22418 and 22828. The genotype data of PLCO were generated from the Illumina GSA (673,132 markers), Oncoarray (474,276 markers), and historical data including Illumina OmniExpress (OmniX) (715,823 markers), Omni2.5M (Omni25) (2,310,570 markers) and Human Quad 610 (580,912 markers) SNP arrays. The duplicated samples between different platforms were removed. Genotyping of 533,631 SNPs in ILCCO-OncoArray was completed at the Center for Inherited Disease Research, the Beijing Genome Institute, the Helmholtz Zentrum München, Copenhagen University Hospital and the University of Cambridge in Illumina Infinium OncoArray platform.[165]^50 Before standard quality control, we removed the intentionally duplicated samples and samples from unrelated OncoArray studies and HapMap control individuals of European, African, Chinese and Japanese origins. The genotype data of TRICL were generated from the Affymetrix Axiom Array containing 414,504 markers, which was a custom panel of key LC GWAS markers, and rare coding SNVs and indels. Further excluded were those who lacked disease status, were second-degree relatives or closer having identity by descent (IBD) > 0.2 or had low-quality DNA (call rate < 95%), or sex inconsistency, or were non-European. SNPs were removed if meeting any of the following criteria: (1) sex chromosome, (2) minor allele frequency (MAF) < 0.05, (3) call rate < 95%, and (4) Hardy-Weinberg equilibrium (HWE) test P < 1.00×10^−7 in controls or P < 1.00×10^−12 in cases. The summary-level data of FinnGen cohort was based on the chip genotype data with Illumina and Affymetrix arrays, and then imputed by using the population-specific SISu v3 imputation reference panel. Data processing of the deCODE Genetics and Fenland cohort deCODE Genetics sequenced the whole genome of 49,708 Icelanders to a median of 32× using Illumina technology.[166]^47 In total, deCODE Genetics has genotyped 166,281 Icelanders with Illumina SNP chips, long-range phased and imputed based on the sequenced dataset, of which 35,559 had proteins measured by SomaScan version 4. Aptamers for non-human proteins and aptamers listed as deprecated by SomaScan as well as aptamers mapping to multiple genes were excluded, leaving 4,907 aptamers target a total of 4,719 unique proteins. Fenland participants were genotyped using three genotyping arrays: Illumina Infinium Core Exome 24v1, Affymetrix SNP 5.0, and Affymetrix UK Biobank Axiom array for 12,435 participants of White British ancestry, which resulted in > 17 million genotyped or imputed variants with a minor allele frequency of > 0.1%. Relative protein abundances of 4,775 human protein targets were evaluated by 4,979 aptamers (SomaLogic V4), which only included human protein targets and SOMAmers deemed acceptable by SomaLogic and that had a value between 0.25 and 4 for all scaling factors.[167]^14 The 4775 proteins were used to perform the association study. Imputation based on TOPMed imputation server To estimate missing genotype information, we performed an imputation on the TOPMed online imputation server, which phased haplotypes with Eagle v2.4[168]^51 using TOPMed data (Version r2) as a reference panel that included 97,256 reference samples and 308,107,085 genetic variants.[169]^52 The server performed imputations for PLCO, ILCCO-OncoArray, and TRICL using Minimac (version 4) software. All genetic variants were lifted to GRCh38 coordinates to maintain consistency with the UKB WES project. Poorly imputed SNVs with imputation quality score R^2 < 0.4 and SNVs on sex chromosomes were excluded from the analyses. Association analyses of single-variant level We performed single-variant association tests for common variants (MAF ≥ 0.01) using Scalable and Accurate Implementation of GEneralized mixed model SAIGE (v1.1.6).[170]^39 SAIGE is a toolkit developed for genome-wide association tests in biobank-level datasets that uses saddlepoint approximation to handle extreme case-control imbalances of binary traits and linear mixed models to account for sample relatedness. Proteome-wide MR (PW-MR) analyses We performed Proteome-Wide Mendelian Randomization (PW-MR) based on cis- and trans pQTL from deCODE and Fenland, and GWAS summary data from UKB, PLCO, ILCCO-OncoArray, and TRICL, and FinnGen. For instrument variable selection, we retained SNPs which were genome-wide significant (both cis-SNP and trans-SNP, P < 5.0 × 10^-8) in each pQTL in our study. In order to minimize correlated horizontal pleiotropy, we retained SNPs independent of each other (LD windows 10000kb, R^2 < 0.1 in 1000G) in pQTLs. To quantify the statistical power of the pQTLs, strength of SNPs was evaluated by F-statistics, where F-statistics threshold ≥ 10 of IV indicates sufficient statistical strength.[171]^53 If any IVs had F-statistics < 10, we considered those to have limited power (potentially causing weak instrument bias[172]^54 and removed these from the MR. Bayesian colocalization analysis using COLOC To further investigate whether association signals with robust causal proteins and lung cancer were derived by shared causal variant, we performed a Bayesian colocalization analysis of 270 causal proteins and lung cancer in deCODE and Fenland. Colocalization analysis relies on a single causal variant assumption, and the posterior probability (PP) for five hypotheses at each pleiotropic locus is provided: (i) H[0]: neither trait has a genetic association in the region; (ii) H[1]: only trait 1 has a genetic association in the region; (iii) only trait 2 has a genetic association in the region; (iv) both traits are associated, but with different causal variants; (v) both traits are associated and shar a single causal variant. The prior probabilities were set as p[1]=10^-4, p[2]=10^-4, p[12]=10^-5. For each robust causal protein, regions were defined as area within 500kb of selected variants. The posterior probability for shared a causal variant (PP.H4) >0.7 was considered to have a strong support of colocalization. Medium colocalization indication was defined as 0.5< PP.H4 <0.7. Comparison analyses for gene expression or protein abundance in tumor and adjacent normal tissues In our study, on the tissue level, we integrating transcriptomic and proteomic measurements to validate whether MR-identified robust and moderate causal proteins and protein-coding genes were observably associated with lung cancer. The paired gene expression and protein abundance data tumor and adjacent normal tissues was collected from CPTAC. The protein abundances were further grouped by unique gene names using sum of all the protein abundances belonging to the identical gene name. Pathway enrichment analysis In our study, pathway enrichment analysis was performed to identifies specific biological pathways as being particularly abundant in a list of protein-coding genes of MR-identified robust and moderate causal proteins. We collected the pathway information with gene sets from the KEGG database, containing a total of 213 pathways. All enrichment analyses were performed using the R package clusterProfiler.[173]^42 Protein-protein interaction analysis To further understand the protein-protein interactions, we used the STRING database, which considered both physical interactions as well as functional associations.[174]^55 The protein interaction network was clustered into different colors using Markov Clustering (MCL). Development and validation of the risk score based on causal proteins We developed a causal-protein risk score (CPRS) for population risk stratification based on lung cancer causal proteins. To perform independent validation phase, the PW-MR-meta causal effect defined as weights of selected proteins. The CPRS was constructed included robust and moderate causal proteins (FDR-q < 0.05 in one data set, FDR-q < 0.05 or P < 0.05 in another) in the PW-MR-meta analysis with consistent direction of observational (Cox proportional hazards model) effect and causal effect. The CPRS was generated as: [MATH: CPRS=1n βiPi :MATH] , where [MATH: βi :MATH] denoted the coefficient of the i^th protein P[i] calculated by PW-MR-meta ((Beta[deCODE] + Beta[Fenland])/2). In the validation phase, the protein panel with previously determined weights was used to generate CPRS in the UKB Olink proteomics data. We used person-year to describe the absolute lung cancer incidence risk, which was defined as the time gap from the data of cohort enrollment to lung cancer diagnosis or the last follow-up, whichever came first. Polygenic risk score generation Polygenic risk score (PRS) was constructed as the sum of the number of minor alleles of SNPs participants carries, weighted by their effect size as Iog-odds ratio. We used PRS-128 (128 SNPs) to generate PRS[175]^56^,[176]^57 based on UKB Olink proteomics data using PLINK 1.9 ([177]https://www.cog-genomics.org/plink/). These SNPs were collected from the known susceptibility loci of lung cancer and conditions related to lung cancer (such as lung function impairment) previously identified through literature curation and NHGRI-EBI GWAS Catalog ([178]https://www.ebi.ac.uk/gwas/), and additional loci that passed the suggestive significance-level in GWAS studies. When correlation exists, variants representing independent loci with the strongest statistical significance were retained. Qualification and statistical analysis Statistics and software The genome-wide association testing of 4,719 plasma protein were preformed using linear mixed model with 27.2 million imputed variants as genotypes after adjusting rank-inverse normal transformed levels for age, sex, and sample age for the deCODE Health study. The likelihood-ratio test to compute all P values.[179]^47 PW-MR analyses were performed using Wald ratio (proteins with only one available SNP) or inverse variance weighted (IVW) method[180]^58^,[181]^59 for all other proteins by TwoSampleMR package[182]^40 to estimate the causal effect of blood proteins on lung cancer. MR Steiger test of causality directionality was performed using ‘directionality_test’ function in TwoSampleMR package.[183]^40 The presence of pleiotropy was further investigated using MR-PRESSO and MR-Egger method[184]^60 to estimate the potential effect of pleiotropy.[185]^61 For MR-Egger intercept P value<0.05, we considered these protein-disease signals as influenced by horizontal pleiotropy. We also applied Cochran’s Q test to estimate the potential heterogeneity of MR estimates (P value<0.05). Proteins with horizontal pleiotropy and heterogeneity were excluded from any of the follow-up analyses. The results of PW-MR were summarized by meta-analysis (PW-MR-meta). A 5% FDR correction threshold was applied to correct for multiple testing. Bayesian colocalization was performed using the “coloc” R package.[186]^41 Paired t-test (FDR-q < 0.05) and fold change (FC > 1.5 or FC < 0.5) were used to identify the differential gene expression or protein abundance. Cox proportional hazards models Hazard ratios were used to evaluated the association between CPRS, PRS and lung cancer risk adjusting for age, sex, body mass index (BMI), and smoking status. (HRs) and 95% confidence interval (CI) were calculated. Participants were classified into ten equal parts according to the distribution of CPRS and PRS, respectively. And we compared hazard ratios (HR) for each part with those at the lowest tenth. Individuals within the top 10%, 10%-90%, and the bottom 10% of CPRS and PRS were considered as populations at high, intermediate, and low genetic risk respectively. We also used Cox regression to calculate and compared cumulative incidence of CPRS and PRS for lung cancer risk in each three subgroups, respectively. The discrimination performance of the risk scores were evaluated by Harrell’s C-index and the time-dependent area under receiver operating characteristic curve (AUC) using the R package timeROC. We used 10-fold internal cross-validation (repeat 500 times) that adjusted AUC for overfitting. Student t-test was performed to analysis the differences of Harrell’s C-index using the “cindex.comp” function of R package SurvComp.[187]^62 All data analyses were performed using R software (version 4.2.3, [188]https://www.r-project.org/). Acknowledgments