Abstract Polygenic risk score (PRS) and rare monogenic variant screening are valuable tools for predicting cancer risk and identifying individuals at high risk. Integrating both common and rare genetic variants is crucial for accurate risk assessment. However, estimating the impacts of rare variants on cancer and combining them with PRS remains challenging. Here, we analyze 454,711 exome sequencing and 487,409 array UK Biobank samples, focusing on breast and prostate cancers. We introduce an expanded PRS (EPRS) approach, yielding a systematic model for more effective risk stratification. By prioritizing and clustering genes with cancer-specific rare variants based on odds ratios and population-attributable fraction, we refine risk stratification by combining both monogenic and polygenic effects. Individuals in high-PRS groups with rare high-impact gene variants show up to 15- and 22-fold higher risk for breast and prostate cancers, respectively, compared to those in the intermediate-PRS groups without rare variants. Combined risk profiles vary across distinct rare variant clusters within the same PRS group for both cancers. Our EPRS approach enhances risk stratification for breast and prostate cancers, offering important insights for future research and potential applications to other cancer types. Subject terms: Computational biology and bioinformatics, Risk factors __________________________________________________________________ An expanded PRS (EPRS) approach combining polygenic risk scores and rare variant clustering enhances cancer risk stratification for breast and prostate cancers. High-PRS groups with rare high-impact gene variants have up to 15- and 22-fold higher risk for breast and prostate cancers, respectively, compared to intermediate-PRS groups without rare variants. Introduction Public health initiatives focus on reducing healthcare burdens by leveraging the social and financial benefits of disease prevention. Early disease detection facilitates timely interventions, thereby preventing disease progression and the development of symptoms^[34]1. Preventive measures involve identifying high-risk individuals for specific diseases by assessing their potential risk. Breast and prostate cancers are highly prevalent, with breast cancer being the most common among female patients, while prostate cancer ranks the second most common among male patients^[35]2. The World Health Organization (WHO) advocates for organized, population-based mammography screenings every 2 years for women aged 50 to 69 years, highlighting the prospects for enhancing cost-effectiveness and benefit-to-harm ratios through risk-stratified screening^[36]3. Women at increased risk of breast cancer have several options to reduce their risk, including surgery, medication and lifestyle modifications. Also, breast cancer risk estimation models now exist that attempt to quantify the risk accounting individual factors, breast-related factors, genetic testing and family history factors^[37]4. Prostate cancer screening is recommended by different modalities, including prostate-specific antigen (PSA) serum levels, digital rectal examination, modern imaging techniques (for example, multiparametric MRI), prostate biopsy and liquid biopsy^[38]5. Despite the high prevalence of prostate cancer, its etiology is not well understood. Established risk factors remain limited to advancing age, family history of the malignancy, and specific genetic variants^[39]2. Importantly, both breast and prostate cancers exhibit high heritability estimated at 31% and 58%, respectively^[40]6,[41]7. Consequently, incorporating genetic predisposition into risk stratification for these cancers is crucial. Moreover, there is substantial potential to improve preventive measures for both conditions^[42]3,[43]8. Therefore, further research is needed to develop more effective prevention and risk stratification strategies for both breast and prostate cancers. Genetic architecture studies play a crucial role in distinguishing individuals at potential risk for various diseases^[44]9–[45]12. Particularly, predicting cancer risk, which has high heritability conferred by the genetic variants of an individual, is critically important^[46]13–[47]15. For many hereditary diseases, a small fraction of the population carries monogenic variants, leading to a significant increase in disease risk by interrupting genes within metabolic pathways^[48]16. Additionally, polygenic scores, which combine the effects of numerous common genetic variants on disease risk, reflect an individual’s potential for disease prediction^[49]17–[50]19. Although these common variants have low contributions to disease risk individually, their cumulative influence can be substantial, resulting in risk levels in some individuals comparable to those due to monogenic variants^[51]9,[52]13,[53]20. Current studies on integrating monogenic variant effects into polygenic risk scores (PRS) for disease prediction show that their integration improves the identification of high-risk groups^[54]10,[55]21–[56]24. Previous studies have consolidated monogenic variants into a single carrier status, providing a valuable yet broad perspective on the genetic contributors to disease risk. The carriers were differentiated their effect on the disease based on the previous studies of those known causal genes or exon within specific disease pathways^[57]22. Building upon this foundation, our study introduces a systematic strategy to categorize genes with rare pathogenic variants into clusters, offering a detailed view of monogenic influences on diseases. In addition, rather than solely comparing elevated risks derived from polygenic risk scores, our approach integrates these detailed monogenic clusters with PRS. This integration aims to enhance the risk stratification for individuals, potentially improving the identification process for those at higher risk. Therefore, the present study aimed to stratify breast and prostate cancer risk by analyzing 454,711 exome sequencing and 487,409 array samples from the UK Biobank. The genes selected for analysis were derived from studies on both cancer types to evaluate the ability of our method to prioritize causal monogenic risk variants specific to each cancer^[58]5,[59]25–[60]27. We prioritized genes specific to each cancer type, clustering monogenic variant genes based on their respective odds ratios and population-attributable fractions (PAF)^[61]28. Combining each clustered gene with PRS risk groups partitioned stratified risk into more refined groups. The results revealed that high-PRS samples with high-risk gene clusters had a prevalence of 0.59 for breast cancer and 0.46 for prostate cancer. Furthermore, the odds ratios showed a 15-fold increased risk for breast cancer and a 22-fold increased risk for prostate cancer compared to intermediate-PRS groups without rare variants for each respective cancer. We have termed this approach expanded PRS (EPRS) and discussed the application of this framework for estimating individual risks of breast and prostate cancer, highlighting its potential adaptability to other phenotypes for risk stratification based on monogenic and polygenic risk effects. Results Demographic information of UK Biobank cohort individuals for each cancer type and genetic risk factors From the UK Biobank cohort, which includes 487,409 individuals with genotype data and 454,711 with exome data, we selected 121,918 female and 108,956 male samples for breast and prostate cancer analyses, respectively. The female samples had a mean age of 55.6 years, while the male samples had a mean age of 55.72 years. We identified 12,643 breast and 8753 prostate cancer cases. For both cancer types, the remaining individuals with no other cancer diagnosis or self-reported cancer were classified as controls. Within the female and male control groups, 2856 (2.34%) and 2559 (2.35%) samples harbored pathogenic variants, respectively (Supplementary Table [62]1). Initial validation of PRS in cancer risk stratification We initially confirmed that an increase in PRS of each cancer enhanced its associated risk by comparing equal-sized tertiles of the PRS groups (low, intermediate, and high risk). Odds ratios for each group were calculated via logistic regression, using the intermediate-PRS group as a reference. For breast cancer, the odds ratio of the low-PRS group was 0.53, whereas the high-PRS group had an odds ratio of 1.97(P = 2.70 × 10^−118 and 7.92 × 10^−^207, respectively). The odds ratios for prostate cancer followed a similar trend, with values of 0.52 for the low-PRS group and 6.02 for the high-PRS group (P = 3.60 × 10^−^45 and 0.00E + 00, respectively) (Supplementary Fig. [63]1, Supplementary Table [64]2). These results support the hypothesis that PRS effectively stratifies disease risk. We also assessed the variance of the trait explained by PRS using R^2. The R^2 values on the liability scale were 0.099 and 0.344 for breast and prostate cancer, respectively (Supplementary Table [65]9). Monogenic variant carriers show elevated risk across PRS groups The intermediate-PRS group without any monogenic variants was selected as the reference group for computing odds ratios. We compared the odds ratios of cancer risk for each PRS group—both with and without monogenic variants—with the intermediate-PRS group devoid of monogenic variants. The odds ratios of the PRS groups without monogenic variants were consistent with those of PRS alone, where monogenic variants were not considered (Fig. [66]1, Supplementary Fig. [67]1). Fig. 1. Cancer odds ratio among individuals categorized according to the presence of monogenic variants and polygenic risk scores (PRS). [68]Fig. 1 [69]Open in a new tab Participants with or without monogenic variants were categorized into three strata based on their PRS (low, intermediate, or high). The odds ratio and 95% confidence intervals were calculated from a logistic regression model for (a) breast and (b) prostate cancer. In all three stratified PRS groups, those with monogenic rare pathogenic variants, annotated by ClinVar guidelines, AlphaMissense, and PrimateAI 3D within both breast and prostate cancer candidate gene regions, exhibited an increased risk. The risk of carrier samples showed increased odds ratios across all three PRS groups for breast cancer: 1.16, 2.03, and 4.69 for low, intermediate, and high, respectively. This trend was also observed in prostate cancer, with odds ratios of 0.62, 1.75, and 10.13 for low, intermediate, and high, respectively. For both cancer types, the highest odds ratio was observed in the high-PRS group with monogenic variants, and the lowest odds ratio was observed in the low-PRS group without monogenic variants. Specifically, the odds ratio of the high-PRS group without carriers was higher than that of the low-PRS group with carriers for breast cancer (Supplementary Table [70]3, Fig. [71]1). Genes prioritized and clustered based on odds ratios and PAF In our analysis of the 23 candidate genes from both breast and prostate cancers, we prioritized known cancer-specific causal genes based on their odds ratios and PAFs (Supplementary Table [72]4). Density-based spatial clustering of applications with noise (DBSCAN) was used for clustering the monogenic effect of prioritized genes. For breast cancer, BRCA1, BRCA2, PALB2, CHEK2, and ATM were prioritized from the 23 genes. Among these, BRCA1 exhibited the highest odds ratio at 9.46 (P = 5.46 × 10^−²⁰). The PAF for BRCA2 was the highest at 0.0053. The pathogenic variant count was highest for CHEK2, with 457 occurrences (Table [73]1). We clustered the five breast cancer-specific genes based on their odds ratios and PAFs, resulting in two distinct groups (Supplementary Fig. [74]2a). Table 1. Odds ratios and population-attributable fractions of genes for each cancer. Odds ratio and PAFs of genes were computed and used for clustering monogenic effects for each cancer Gene Carrier (n) OR (CI 95%) P value PAF Monogenic effect cluster Breast cancer BRCA1 69 9.46 (5.85–15.31) 5.46E−20 0.0024 2 BRCA2 216 5.43 (4.12–7.15) 2.11E−33 0.0053 2 PALB2 182 3.27 (2.37–4.52) 6.29E−13 0.0027 1 CHEK2 457 1.66 (1.3–2.1) 3.82E−05 0.0027 1 ATM 370 2.52 (1.98–3.2) 6.34E−14 0.0039 1 Prostate cancer HOXB13 398 3.12 (2.47–3.94) 1.78E−21 0.0072 2 BRCA2 208 1.93 (1.33–2.78) 4.76E−04 0.0021 1 ATM 314 2.17 (1.62–2.89) 1.51E−07 0.0038 1 [75]Open in a new tab PAF population-attributable fraction, OR odds ratio For prostate cancer, HOXB13, BRCA2, and ATM were prioritized (Table [76]1). Of these, HOXB13 showed the highest odds ratio at 3.12 (P = 1.78 × 10⁻^21), a PAF of 0.0072, and a pathogenic carrier count of 398. The three genes were differentiated into two groups (Supplementary Fig. [77]2b). Cancer risk according to stratified risk groups Separate analyses were conducted for each gene cluster, two clusters each for both breast cancer and prostate cancer. The odds ratios for cancer across groups were calculated using the intermediate-PRS group without monogenic variants as a reference. All gene clusters and samples without variants exhibited the highest odds ratios in the order of high-, intermediate-, and low-PRS risk groups (Table [78]2). Table 2. Odds ratio of risk groups stratified using monogenic effect clusters and polygenic risk scores for breast and prostate cancer Monogenic effect cluster PRS risk OR (CI 95%) P-value Sample (n) breast cancer Low 0.52 (0.49–0.55) 7.88E−115 39,664 No variant Intermediate 1 (0.95–1.05) 0.00E+00 39,682 High 1.96 (1.87–2.05) 1.98E−196 39,716 Low 1.49 (1.1–2.02) 1.09E−02 362 1 Intermediate 2.12 (1.6–2.82) 1.84E−07 334 High 5.71 (4.49–7.27) 1.02E−45 311 Low 5.71 (3.71–8.78) 2.30E−15 95 2 Intermediate 9 (6.01–13.49) 1.79E−26 101 High 15.18 (9.79–23.52) 5.04E−34 89 prostate cancer Low 0.52 (0.48–0.58) 2.65E−42 35,542 No variant Intermediate 1 (0.93–1.08) 0.00E+00 35,485 High 6.02 (5.65–6.41) 0.00E+00 35,370 Low 1.38 (0.69–2.79) 3.65E−01 165 1 Intermediate 2.35 (1.25–4.42) 7.97E−03 142 High 15.58 (11.36–21.36) 3.41E−65 215 Low 0.44 (0.11–1.82) 2.60E−01 100 2 Intermediate 3.55 (2.03–6.19) 8.34E−06 130 High 22.41 (16.08–31.24) 2.77E−75 168 [79]Open in a new tab OR odds ratio, PRS polygenic risk score In breast cancer, gene cluster 1 demonstrated odds ratios of 1.49, 2.12, and 5.71 for low-, intermediate-, and high-PRS risk groups (P = 1.09 × 10^−^2, 1.84 × 10^−^7 and 1.02 × 10^−^45), respectively. The odds ratio of gene cluster 1 for the low-PRS risk group was lower than that of the no-variant high-PRS risk group (1.49 vs. 1.96) (Table [80]2). The odds ratios of the low-, intermediate- and high-PRS risk groups of gene cluster 2 significantly increased to 5.71, 9.00 and 15.18 (P = 2.30 × 10^−^15, 1.79 × 10^−^26and 5.04 × 10^−^34), respectively with 15.18 being the highest among all risk groups (Fig. [81]2a, Table [82]2). Fig. 2. Cancer odds ratio among individuals categorized according to monogenic effect clusters and polygenic risk score (PRS). [83]Fig. 2 [84]Open in a new tab Participants were categorized into three strata based on their PRS (low, intermediate, or high) and monogenic effect cluster. The odds ratios and 95% confidence intervals (CIs) were calculated from a logistic regression model for (a) breast and (b) prostate cancer. For prostate cancer, the odds ratios for low-, intermediate-, and high-PRS risk groups in gene cluster 1 were 1.38, 2.35, and 15.58, respectively. The odds ratio for the low-PRS risk group in gene cluster 1 was lower than that of the no-variant high-PRS risk group (1.38 vs. 6.02). Additionally, the odds ratios for the low-, intermediate-, and high-PRS risk groups, when associated with gene cluster 2, were all significant: 0.44, 3.55, and 22.41, respectively. The low-PRS risk group in gene cluster 2 had the lowest odds ratio among all risk groups at 0.44, which was lower than that for the low-PRS risk group without variants. Notably, there was a significant increase in odds ratios within the high-PRS risk group across every monogenic effect cluster (Fig. [85]2b, Table [86]2). Validation of the EPRS model To substantiate the robustness of our EPRS model, we conducted a validation using 5-fold cross-validation within the UK Biobank dataset. We classified samples based on their PRS group and whether they carried pathogenic variants in specific gene clusters. Logistic regression models were then fitted to the training data, genetic data with cancer status. This process tested the EPRS across three distinct PRS groups and two gene clusters, employing the Area Under the Curve (AUC) metric to gauge predictive accuracy. The resulting mean AUC values were 0.632 for breast cancer and 0.734 for prostate cancer, affirming the model’s efficacy in stratifying cancer risk. Prevalence of cancer risk stratified by EPRS We estimated the stratified cancer risk by EPRS, considering both PRS risk groups and gene clusters, and assessed their influence on the prevalence of each cancer. The prevalence was calculated as the proportion of individuals within the cohort who had either a prior diagnosis of cancer or developed cancer during the study period. This allowed us to capture the overall burden of disease, including both pre-existing and incident cases, to assess the ability of EPRS to stratify risk across a comprehensive population. The mean prevalence of each gene within a gene cluster per PRS risk group was calculated to demonstrate the combined effect of PRS risk and gene clusters for risk stratification. The highest prevalence of the breast cancer risk group, high-PRS risk in gene cluster 2, was 0.59. The low-PRS risk group without any pathogenic variants had the lowest prevalence at 0.053. (Supplementary Table [87]5, Fig. [88]3a). A consistent increase in cancer prevalence was observed as both PRS risk and monogenic effect clusters increased. Fig. 3. Prevalence of risk groups. [89]Fig. 3 [90]Open in a new tab The prevalence of stratified risk groups according to their monogenic effect cluster and polygenic risk score (PRS) risk group were calculated for UK Biobank samples for (a) breast and (b) prostate cancer. Among the prostate cancer risk groups, in gene cluster 2, the high-PRS group had the highest prevalence at 0.46—including only the HOXB13 gene. The lowest prevalence was observed in the low-PRS risk group in gene cluster 2 at 0.020, followed by the low-PRS group without any pathogenic variants at 0.024 (Supplementary Table [91]6, Fig. [92]3b). The prevalence of groups—constructed by EPRS—increased with increasing PRS in the order of low-, intermediate-, and high-risk groups. Additionally, a sequential increase was noted in gene clusters for breast cancer. In prostate cancer, the prevalence in the high-PRS group without any pathogenic variants surpassed those in the intermediate-PRS groups with both gene clusters 1 and 2. Discussion Disease risk can be stratified according to PRS in conjunction with monogenic variants in high-risk genes^[93]11,[94]13,[95]21. Using the EPRS approach, we systematically categorized monogenic variants by clustering risk genes using odds ratios and PAF values and then assessed the extent to which PRS influences each cluster in breast and prostate cancers. Through EPRS, we were able to observe the contributions of monogenic and polygenic effects on cancer risk, improving the understanding of the genetic profile influencing cancer risk. PRS demonstrated significance in stratifying both breast and prostate cancer risk. The odds ratios of the low- and high-PRS groups for both cancers significantly differed from that of the intermediate-PRS group in both analyses. These findings demonstrate that the cumulative effect of SNPs increases the risk of cancer, indicating that PRS alone can be utilized to stratify the risk of an individual. We used three different summary statistics for each cancer to construct PRS and applied PRSice2, LDpred2, and SbayesR. The performance of PRS was evaluated using R^2 on the liability scale, and the best-performing methods were selected: LDpred2 using summary statistics from Zhang et al. ^[96]29 for breast cancer, and PRSice2 using summary statistics from Wang et al. ^[97]30 for prostate cancer (Supplementary Table [98]9)^[99]29,[100]30. In addition, by incorporating the monogenic variant effect, we also observed increased cancer risk in each PRS group with pathogenic variants compared with that of PRS groups without any pathogenic variants. Pathogenic variants increase cancer incidence by interrupting metabolic pathways^[101]13,[102]16. In our study, cancer risk varied by PRS group and the presence of variants. Notably, the intermediate-PRS group with variants exhibited a significantly increased cancer risk compared with the high-PRS group without variants. Moreover, the high-PRS group with variants for both cancers displayed the highest risk among all groups. Samples with monogenic variants in each PRS risk group demonstrated an up to 2-fold higher risk than those without monogenic variants for both cancers. Interestingly, the low-PRS group with monogenic variants slightly exceeded the risk of the intermediate-PRS group without variants for both cancers. Although the odds ratios were not significant, these findings suggest that monogenic effects can amplify the risk in samples with low polygenic effects. We observed stratified risks in each group, depending on the absence or presence of monogenic variants. However, given that the impact of the presence or absence of monogenic variants can have a considerably more critical effect on risk than SNPs, applying the summation of genetic effects, often used for PRS construction, to monogenic variants may not fully represent the genetic risk of disease. Furthermore, the risk associated with monogenic variants can vary depending on the specific gene hosting that variant. Previous studies have primarily focused on the elevated cancer risks associated with well-known risk-increasing genes in conjunction with PRS^[103]13,[104]22,[105]23. Notably, our EPRS approach provides a systematic method for prioritizing and clustering monogenic effects and integrating them with PRS, thereby refining cancer risk stratification. In our EPRS approach, we prioritized genes specific to each type of cancer and clustered the monogenic effects based on their odds ratios and PAFs. By selecting genes with odds ratio Bonferroni-adjusted P-values less than 0.0022 and PAF values greater than 0, we were able to highlight the genes most significantly affecting each cancer type. This approach aligned with previous studies that identified genes, such as ATM, BRCA1, BRCA2, CHEK2, and PALB2 as associated with an increased risk of breast cancer^[106]26,[107]31, and HOXB13 and BRCA2 with prostate cancer^[108]5,[109]27. We then clustered the identified genes to estimate their associated cancer risk. In breast cancer, we identified five genes grouped into two distinct monogenic effect clusters. Monogenic effect clusters 1 and 2 showed moderate- and high-risk effects, respectively. In prostate cancer, we identified three genes clustered into two groups. Monogenic effect clusters 1 and 2 demonstrated moderate- and high-risk effects on prostate cancer. Additionally, we incorporated these cluster effects within PRS groups, facilitating a more detailed subdivision of risk stratification and its characteristics. When combined with PRS, all monogenic clusters increased the risk for both breast and prostate cancers. However, the degree of increase varied depending on the monogenic cluster effects, which were revealed through the odds ratios of each risk group. In breast cancer, monogenic effect clusters 1 and 2 demonstrated moderate- and high-risk effects, respectively, when combined with PRS. A total of 1292 samples were newly classified into different groups compared with those stratified by PRS risk alone. The odds ratios demonstrated a concurrent elevation in breast cancer risk influenced by both polygenic and monogenic effects. Samples in gene cluster 2 within the low-PRS risk category exhibited a higher odds ratio than even the high-PRS risk group without variants. High-PRS risk samples in clusters 1 and 2 displayed higher odds ratios than the high-PRS risk group with unclustered variants. This detailed stratification of monogenic effect clusters allowed us to observe more specific risk differences. Applying the EPRS approach to prostate cancer resulted in 817 samples being reclassified into different risk groups compared to those obtained using PRS alone. Similar to breast cancer, monogenic clusters 1 and 2 in prostate cancer demonstrated moderate and high risks, respectively, when combined with PRS. Despite this, the polygenic effects were more pronounced than the monogenic effects in our analysis. PRS showed robust predictive performance with an R² of 0.344 on the liability scale, presenting a steep increase in odds ratios through PRS risk levels. The highest and lowest odds ratios were observed in the high- and low-PRS risk groups of gene cluster 2, respectively. Therefore, our analysis enhanced the distinction of cancer risk groups beyond the scope of PRS alone by incorporating monogenic effect clusters. Our findings highlight that the combined effect of PRS and monogenic clusters can substantially influence cancer risk. This was also evident in the observed prevalence of cancer among risk groups. For breast cancer, monogenic cluster 2 demonstrated a higher prevalence across all PRS groups; specifically, the high-PRS group had more than half of the samples in the risk group diagnosed with breast cancer. BRCA1 and BRCA2, which were identified as critical genetic variants causing breast cancer^[110]26,[111]32–[112]35, were clustered in monogenic effect cluster 2. Specifically, in the high-PRS group, both genes exhibited high prevalence values: 0.67 for BRCA1 and 0.52 for BRCA2 (Supplementary Table [113]7). The prevalence of prostate cancer among risk groups also varied according to their PRS group and monogenic cluster effect, with a trend of increasing prevalence as the PRS risk group ascended. Monogenic effect cluster 2 exclusively contained HOXB13, a causative gene of prostate cancer^[114]23,[115]27. This gene demonstrated a higher prevalence than that of BRCA2 and ATM in cluster 1 for prostate cancer (Supplementary Table [116]8). In breast cancer, pathway enrichment analysis revealed significant involvement in DNA damage response and repair mechanisms. Both clusters shared enrichment in critical pathways such as DNA double-strand break repair, cellular response to DNA damage, and cell cycle checkpoints, underscoring their collective role in maintaining genomic stability. Key gene ontology biological processes associated with these pathways include double-strand break repair, DNA repair, and signal transduction in response to DNA damage. Despite their common roles, each cluster also exhibited unique pathway enrichments. BRCA1 and BRCA2 of cluster 1 were uniquely involved in pathways like the ATR-BRCA pathway and homology-directed repair, emphasizing their roles in precise DNA repair through homologous recombination and apoptotic signaling. Conversely, ATM, PALB2, and CHEK2 of cluster 2 were enriched in pathways related to diseases of DNA repair and response to ionizing radiation, highlighting their roles in signaling and repair processes under stress conditions. For the pathway enrichment analysis of prostate cancer, two gene clusters emerged: one involving HOXB13 and another comprising ATM and BRCA2. Both clusters shared involvement in developmental and differentiation pathways, such as gland development, reproductive system development, and cellular growth. However, each cluster also exhibited unique pathway enrichments, reflecting their distinct functions. The ATM and BRCA2 of cluster 2 are enriched in DNA repair and damage response pathways, including homologous recombination, the ATR-BRCA pathway, and the DNA repair complex, emphasizing their critical roles in genomic stability and preventing mutation propagation. In contrast, HOXB13 is uniquely enriched in pathways related to cellular growth and maturation, indicating its pivotal role in development and differentiation. This study has some limitations. First, different partitioning criteria for PRS could have potentially shown better performance in cancer risk prediction compared to equal-sized tertiles. Various partitioning criteria, ranging from 5% to 35% in increments of 5% for both top and bottom percentages, were applied to our EPRS approach. The mean AUC was calculated using 5-fold cross-validation, and the 30% partitioning criteria yielded the best performance for breast cancer, while 20% was optimal for prostate cancer (Supplementary Table [117]10). This approach effectively segregated high- and low-PRS risk groups. The highest and lowest prevalence of risk groups remained the same, with high-PRS and gene cluster 2 showing the highest risk, and low-PRS with no monogenic variant showing the lowest risk. The prevalence increased and decreased accordingly in these groups (Supplementary Fig. [118]3). However, the optimal partitioning criteria varied for different cancer types. Although we explored various segregation percentages, there may still be room for improvement as PRS is a continuous variable. Future applications of different segregation methods may enhance the understanding of polygenic effects. Nevertheless, the current study primarily focused on the potential of systematic risk stratification using genetic profiles. The second limitation is a potential bias in selecting genes for systematic prioritization and monogenic effect clustering. The challenge of considering all functional genes is substantial. Given their shared causal genes, we focused on breast and prostate cancers to lessen the computational and financial burdens of analysis. Candidate genes specific to each cancer were selected based on previous studies^[119]5,[120]25–[121]27. Although all pathogenic variants were accounted for regardless of the type of candidate cancer gene, each cancer-specific gene was prioritized, and cancer-specific monogenic clusters showed an elevated effect on cancer risk. However, further research into additional diseases is necessary for more precise systematic risk stratification. The third limitation of this study arises from the recruitment bias of UK Biobank. This cohort predominantly consists of participants who are older, more educated, and of European ancestry. Moreover, these participants generally exhibit healthier lifestyles, and a lower prevalence of several health conditions compared to the general UK population. They are notably less likely to be obese, smoke, or consume alcohol daily. These characteristics suggest a ‘healthy volunteer’ bias, which may affect the generalizability of our findings, including the calculated Population Attributable Fractions (PAFs) and the observed cancer prevalence. To overcome these limitations, further studies should consider incorporating a broader population^[122]36. In summary, this study aimed to systematically stratify the risk of cancers by clustering genes with pathogenic variants based on odds ratios and PAF, which are used to infer risk levels elevated by rare variants. We addressed this using sequencing data from the UK Biobank. Our findings suggest that, for breast cancer, relying solely on the popular value odds ratios may be insufficient to fully capture the risk contribution of certain genes. In our study, the well-known genes BRCA2 and PALB2 were similar in terms of odds ratios but differed in PAF, leading to their placement in different clusters. These clusters displayed distinguishable patterns in terms of breast cancer prevalence. Moreover, when combined with a polygenic risk score based on common variants, data regarding individuals with rare variants in BRCA1 and BRCA2 showed stratified patterns of cancer prevalence depending on PRS level. Similar findings were observed for prostate cancer. Therefore, we suggest that when considering risk stratification for cancer, it is beneficial to focus on both rare and common variant information, incorporating metrics such as PAF in addition to odds ratios for the estimation of rare variant gene effects. However, we acknowledge certain limitations in our approach. Gene selection was based on a literature review, which may introduce bias and potentially miss significant genes. Furthermore, the pathogenicity assessment of rare variants and gene clustering methods relied on subjective thresholds. The optimal PRS grouping thresholds varied between traits, suggesting that some may be trait-specific. While more quantitative and statistically rigorous methods, such as gene-based common variant scores combined with rare-variant burden tests, could provide a more robust framework for stratification, these methods require a much larger set of genes and significantly more data. To apply such methods effectively, future studies will require larger sample sizes and more comprehensive datasets, such as those obtained through whole genome sequencing (WGS). Incorporating WGS data would allow for the inclusion of a broader spectrum of variants, thereby providing a more detailed understanding of the roles of low-frequency, rare, and somatic variants in cancer risk. Such enhancements would significantly improve the precision and accuracy of systematic risk stratification. Patients and methods Data source This study utilized genetic and phenotypic data from the UK Biobank (application ID 72128), a large-scale health cohort study designed to provide robust statistical power for various analyses. Data collection was conducted from 2006 to 2010, encompassing more than 500,000 participants aged between 40 and 69 years from multiple assessment centers, primarily in England, Scotland, and Wales. The UK Biobank performed high-quality genome-wide genotyping and genotype imputation, leveraging a reference panel from the Haplotype Reference Consortium^[123]37. Despite the overall healthier status of the cohort, lower prevalence of obesity, and reduced incidence of smoking or alcohol consumption, it is considered a representative sample of the white British population in the United Kingdom^[124]36. All ethical regulations relevant to human research participants were followed. Study participants This study identified cancer cases according to the International Classification of Diseases (ICD) -10 code in data field 41270, ICD-9 code in data field 41271, or self-reported cancer code in data field 20001. For breast cancer, female participants exhibiting ICD-10 code C50X, ICD-9 code 174X, or self-reported cancer code 1002 were categorized as cases. Similarly, male samples with ICD-10 code C61, ICD-9 code 185X, or self-reported cancer code 1044 were classified as prostate cancer cases. Samples devoid of other cancer diagnoses or self-reported cancer were assigned as controls. Sample and genotype quality control To analyze individuals with relatively homogeneous ancestry and owing to the small percentages of non-British individuals, the present analysis was restricted to white British ancestry individuals. Genetically confirmed ancestry was used to identify this subgroup, utilizing principal components in data field 22020. Exclusion criteria encompassed putative sex chromosome aneuploidy, which refers to samples identified as potentially carrying sex chromosome configurations other than the typical XX or XY, found in data field 22019, genetic kinship with other participants as indicated in data field 22021, and withdrawal of informed consent, identified centrally. Our study initially involved 454,711 exome sequencing samples and 487,409 array samples. Following quality control of the samples with imputed genotype data, 333,990 samples remained. The intersection of these datasets resulted in a total of 311,225 samples that were utilized for the analyses (Supplementary Fig. [125]4). In addition, variants demonstrating an imputation quality score below 0.7, minor allele frequency under 0.01, or missing genotype rate exceeding 0.05, were excluded. After these quality control measures were applied to imputed genotype data, a total of 8,700,879 variants remained for our analyses (Supplementary Fig. [126]5). The quality control processes were executed using Plink software version 2.0. Calculating polygenic risk scores from array data The PRS were calculated using three distinct tools: PRSice2 (version 2.3.3)^[127]38, LDpred2^[128]39, and SbayesR^[129]40, each offering unique methodologies to enhance the accuracy and applicability of PRS. PRSice2 clusters single nucleotide polymorphisms (SNPs) based on linkage disequilibrium (LD) and P-value, followed by P-value thresholding, using default clumping options (–clump-kb 250 kb, clump-p 1, and clump-r2 0.1). LDpred2 utilizes a point-normal prior for SNP effect sizes and employs a Markov Chain Monte-Carlo (MCMC) procedure to infer posterior mean effect sizes. SbayesR performs Bayesian posterior inference to accommodate SNPs with small, medium, and large effects, thus allowing for more general effect size distributions^[130]41. For each cancer type, summary statistics were derived from genome-wide association studies^[131]19,[132]29,[133]30,[134]42–[135]44. Among these tools and summary statistics, we selected the most effective combination for each cancer based on R^2 performance on the liability scale (Supplementary Table [136]9). Specifically, we used LDpred2 with Zhang et al. ^[137]29 for breast cancer and PRSice2 with Wang et al. ^[138]30 for prostate cancer^[139]29,[140]30. Selection of candidate genes for identifying rare variant Based on previous research findings, we identified 23 candidate genes to investigate pathogenic rare variant for breast and prostate cancer. For breast cancer, we selected the susceptibility genes frequently identified in sequencing panels and genes housing protein-truncating variants associated with overall breast cancer^[141]25,[142]26. For prostate cancer, we initially used the gene list known to heighten the risk of prostate cancer due to germline variants^[143]5. In addition, we included the overlapping genes from the list associated with hereditary prostate cancer^[144]27. Identification of pathogenic variants Pathogenic variants were identified using the dx command-line client of the DNAnexus Platform SDK, which allowed us to download WES variant call format (VCF) files from the UKB Research Analysis Platform (RAP). We selected WES population VCF (pVCF) files based on the location of candidate genes, creating target regions of the genes using bedtools version 2.30.0 of the app-swiss-army-knife in the dx client. We downloaded each resulting region-based VCF file via RAP DNAnexus. Annotation of VCF files was achieved through an in-house pipeline, with individual variants annotated using SnpEff (5.0e) and dbNSFP (4.2c) software^[145]45,[146]46. The criteria for pathogenic variant selection were as follows: 1) variants deemed pathogenic or likely pathogenic by ClinVar; 2) variants with a ClinVar review status of two stars or higher; and 3) variants classified as pathogenic or likely pathogenic according to ClinVar, provided the variant was consistently categorized as such or had more than five pathogenic or likely pathogenic annotations without any benign/likely benign classification. We used the ClinVar database updated as of June 3, 2024. Additionally, we incorporated information from AlphaMissense^[147]47 and PrimateAI^[148]48 to predict variant pathogenicity. Both algorithms utilize primate variant population frequency databases to predict the pathogenicity of missense variants. AlphaMissense adapts AlphaFold fine-tuned on human data, while PrimateAI employs deep neural networks. Both algorithms provide pathogenicity scores ranging from 0 to 1 and classify variants based on specific thresholds. In AlphaMissense, variants with a pathogenicity value over 0.9 are considered pathogenic^[149]49 while PrimateAI classifies pathogenic variants as ‘deleterious’. Only variants meeting both criteria were classified as pathogenic and included in our analysis. Prioritization and clustering of monogenic effects To prioritize the genes for specific cancer types, we considered all 23 candidate genes for each cancer. We identified samples carrying pathogenic variants within these genes, treating the presence of a pathogenic variant as an interruption of the gene’s function. Thus, any individual harboring a pathogenic variant was considered a carrier. With the cancer phenotype, we calculated the odds ratio and PAF for each gene. For each gene, we computed the odds ratios using a logistic regression model, controlling for age at recruitment and the first four principal components (PCs). Simultaneously, we computed the PAF based on the estimated relative risk associated with the presence of a monogenic variant and the prevalence of the variant among cases^[150]28. We calculated the PAF for each gene to determine the fraction of each cancer attributable to rare variants in that gene. Genes demonstrating odds ratio Bonferroni-adjusted P-values less than 0.0022 and PAF values greater than 0 were selected. This approach ensured the identification of genes exerting a significant influence on each cancer type. Subsequently, we grouped these prioritized genes to estimate their collective cancer risk, employing DBSCAN. This method incorporated both the log 10 of odds ratio and PAF for each gene to adjust the scale of both values. In our approach, this clustering synthesizes multiple rare variants within a gene into a single monogenic effect, facilitating a comprehensive analysis of their combined impact on cancer risk. Validation of the EPRS model We conducted 5-fold cross-validation within the UK Biobank dataset. We included the process involved testing the EPRS across various partitioning criteria to identify the optimal PRS group thresholds for various cancer type (Supplementary Table [151]10). First, we partitioned the PRS into three distinct groups based on specific quantile thresholds. The top and bottom quantiles tested were 5%, 10%, 15%, 20%, 25%, 30%, 33.3%, and 35%. This approach allowed us to determine the most effective thresholds for stratifying individuals into low, intermediate, and high-PRS groups. Next, we merged these PRS groups with phenotypic and genetic cluster data. Samples were classified based on their PRS group and whether they carried pathogenic variants in specific gene clusters. Logistic regression models were then fitted to this data, adjusting for the effects of age and the first four principal components. We performed 5-fold cross-validation to evaluate the model’s performance. The dataset was split into five subsets, where the model was trained on four subsets and tested on the remaining one, rotating through all subsets. We calculated the Area Under the Curve (AUC) metric to gauge predictive performance. Statistics and reproducibility We stratified individuals based on PRS tertiles and the presence or absence of pathogenic variants and divided them into equal-sized PRS groups (low, intermediate, and high). The intermediate-PRS group was used as a reference for computing the odds ratios to assess cancer prevalence in the population. Initial calculations of odds ratios were performed for the three PRS groups, demonstrating risk stratification using PRS only. Our analysis considered all pathogenic variants in candidate genes, using the intermediate-PRS group without any pathogenic variants as a reference. For the monogenic cluster, we computed the odds ratio for each gene harboring pathogenic variants and the PAF for each gene, indicating the fraction of cancer attributable to the interrupted gene. Using these two measures, we classified the effect of monogenic variants through DBSCAN. The reference group was also used to compute the odds ratio for the specific effect of the monogenic cluster on cancer. For each group, we used a logistic regression model adjusted for age at recruitment and the first four PCs to compute the odds ratio. To substantiate the robustness of our EPRS model, we conducted a validation using 5-fold cross-validation within the UK Biobank dataset employing the mean AUC metric to gauge predictive performance. All statistical analyses were conducted using Python version 3.7.9, with modules statsmodels 0.13.2 and scikit-learn 1.1.1. All plots were created using matplotlib 3.5.2 and seaborn 0.11.2. Reporting summary Further information on research design is available in the [152]Nature Portfolio Reporting Summary linked to this article. Supplementary information [153]Supplementary Information^ (277.5KB, pdf) [154]42003_2024_6995_MOESM2_ESM.pdf^ (104.9KB, pdf) Description of Additional Supplementary Files [155]Supplementary data^ (10.7KB, xlsx) [156]Reporting Summary^ (73.7KB, pdf) Acknowledgements