Abstract

   Polygenic risk score (PRS) and rare monogenic variant screening are
   valuable tools for predicting cancer risk and identifying individuals
   at high risk. Integrating both common and rare genetic variants is
   crucial for accurate risk assessment. However, estimating the impacts
   of rare variants on cancer and combining them with PRS remains
   challenging. Here, we analyze 454,711 exome sequencing and 487,409
   array UK Biobank samples, focusing on breast and prostate cancers. We
   introduce an expanded PRS (EPRS) approach, yielding a systematic model
   for more effective risk stratification. By prioritizing and clustering
   genes with cancer-specific rare variants based on odds ratios and
   population-attributable fraction, we refine risk stratification by
   combining both monogenic and polygenic effects. Individuals in high-PRS
   groups with rare high-impact gene variants show up to 15- and 22-fold
   higher risk for breast and prostate cancers, respectively, compared to
   those in the intermediate-PRS groups without rare variants. Combined
   risk profiles vary across distinct rare variant clusters within the
   same PRS group for both cancers. Our EPRS approach enhances risk
   stratification for breast and prostate cancers, offering important
   insights for future research and potential applications to other cancer
   types.

   Subject terms: Computational biology and bioinformatics, Risk factors
     __________________________________________________________________

   An expanded PRS (EPRS) approach combining polygenic risk scores and
   rare variant clustering enhances cancer risk stratification for breast
   and prostate cancers. High-PRS groups with rare high-impact gene
   variants have up to 15- and 22-fold higher risk for breast and prostate
   cancers, respectively, compared to intermediate-PRS groups without rare
   variants.

Introduction

   Public health initiatives focus on reducing healthcare burdens by
   leveraging the social and financial benefits of disease prevention.
   Early disease detection facilitates timely interventions, thereby
   preventing disease progression and the development of symptoms^[34]1.
   Preventive measures involve identifying high-risk individuals for
   specific diseases by assessing their potential risk.

   Breast and prostate cancers are highly prevalent, with breast cancer
   being the most common among female patients, while prostate cancer
   ranks the second most common among male patients^[35]2. The World
   Health Organization (WHO) advocates for organized, population-based
   mammography screenings every 2 years for women aged 50 to 69 years,
   highlighting the prospects for enhancing cost-effectiveness and
   benefit-to-harm ratios through risk-stratified screening^[36]3. Women
   at increased risk of breast cancer have several options to reduce their
   risk, including surgery, medication and lifestyle modifications. Also,
   breast cancer risk estimation models now exist that attempt to quantify
   the risk accounting individual factors, breast-related factors, genetic
   testing and family history factors^[37]4. Prostate cancer screening is
   recommended by different modalities, including prostate-specific
   antigen (PSA) serum levels, digital rectal examination, modern imaging
   techniques (for example, multiparametric MRI), prostate biopsy and
   liquid biopsy^[38]5. Despite the high prevalence of prostate cancer,
   its etiology is not well understood. Established risk factors remain
   limited to advancing age, family history of the malignancy, and
   specific genetic variants^[39]2. Importantly, both breast and prostate
   cancers exhibit high heritability estimated at 31% and 58%,
   respectively^[40]6,[41]7. Consequently, incorporating genetic
   predisposition into risk stratification for these cancers is crucial.
   Moreover, there is substantial potential to improve preventive measures
   for both conditions^[42]3,[43]8. Therefore, further research is needed
   to develop more effective prevention and risk stratification strategies
   for both breast and prostate cancers.

   Genetic architecture studies play a crucial role in distinguishing
   individuals at potential risk for various diseases^[44]9–[45]12.
   Particularly, predicting cancer risk, which has high heritability
   conferred by the genetic variants of an individual, is critically
   important^[46]13–[47]15. For many hereditary diseases, a small fraction
   of the population carries monogenic variants, leading to a significant
   increase in disease risk by interrupting genes within metabolic
   pathways^[48]16. Additionally, polygenic scores, which combine the
   effects of numerous common genetic variants on disease risk, reflect an
   individual’s potential for disease prediction^[49]17–[50]19. Although
   these common variants have low contributions to disease risk
   individually, their cumulative influence can be substantial, resulting
   in risk levels in some individuals comparable to those due to monogenic
   variants^[51]9,[52]13,[53]20.

   Current studies on integrating monogenic variant effects into polygenic
   risk scores (PRS) for disease prediction show that their integration
   improves the identification of high-risk groups^[54]10,[55]21–[56]24.
   Previous studies have consolidated monogenic variants into a single
   carrier status, providing a valuable yet broad perspective on the
   genetic contributors to disease risk. The carriers were differentiated
   their effect on the disease based on the previous studies of those
   known causal genes or exon within specific disease pathways^[57]22.
   Building upon this foundation, our study introduces a systematic
   strategy to categorize genes with rare pathogenic variants into
   clusters, offering a detailed view of monogenic influences on diseases.
   In addition, rather than solely comparing elevated risks derived from
   polygenic risk scores, our approach integrates these detailed monogenic
   clusters with PRS. This integration aims to enhance the risk
   stratification for individuals, potentially improving the
   identification process for those at higher risk. Therefore, the present
   study aimed to stratify breast and prostate cancer risk by analyzing
   454,711 exome sequencing and 487,409 array samples from the UK Biobank.
   The genes selected for analysis were derived from studies on both
   cancer types to evaluate the ability of our method to prioritize causal
   monogenic risk variants specific to each cancer^[58]5,[59]25–[60]27. We
   prioritized genes specific to each cancer type, clustering monogenic
   variant genes based on their respective odds ratios and
   population-attributable fractions (PAF)^[61]28. Combining each
   clustered gene with PRS risk groups partitioned stratified risk into
   more refined groups. The results revealed that high-PRS samples with
   high-risk gene clusters had a prevalence of 0.59 for breast cancer and
   0.46 for prostate cancer. Furthermore, the odds ratios showed a 15-fold
   increased risk for breast cancer and a 22-fold increased risk for
   prostate cancer compared to intermediate-PRS groups without rare
   variants for each respective cancer. We have termed this approach
   expanded PRS (EPRS) and discussed the application of this framework for
   estimating individual risks of breast and prostate cancer, highlighting
   its potential adaptability to other phenotypes for risk stratification
   based on monogenic and polygenic risk effects.

Results

Demographic information of UK Biobank cohort individuals for each cancer type
and genetic risk factors

   From the UK Biobank cohort, which includes 487,409 individuals with
   genotype data and 454,711 with exome data, we selected 121,918 female
   and 108,956 male samples for breast and prostate cancer analyses,
   respectively. The female samples had a mean age of 55.6 years, while
   the male samples had a mean age of 55.72 years. We identified 12,643
   breast and 8753 prostate cancer cases. For both cancer types, the
   remaining individuals with no other cancer diagnosis or self-reported
   cancer were classified as controls. Within the female and male control
   groups, 2856 (2.34%) and 2559 (2.35%) samples harbored pathogenic
   variants, respectively (Supplementary Table [62]1).

Initial validation of PRS in cancer risk stratification

   We initially confirmed that an increase in PRS of each cancer enhanced
   its associated risk by comparing equal-sized tertiles of the PRS groups
   (low, intermediate, and high risk). Odds ratios for each group were
   calculated via logistic regression, using the intermediate-PRS group as
   a reference. For breast cancer, the odds ratio of the low-PRS group was
   0.53, whereas the high-PRS group had an odds ratio of 1.97(P = 2.70 ×
   10^−118 and 7.92 × 10^−^207, respectively). The odds ratios for
   prostate cancer followed a similar trend, with values of 0.52 for the
   low-PRS group and 6.02 for the high-PRS group (P = 3.60 × 10^−^45 and
   0.00E + 00, respectively) (Supplementary Fig. [63]1, Supplementary
   Table [64]2). These results support the hypothesis that PRS effectively
   stratifies disease risk. We also assessed the variance of the trait
   explained by PRS using R^2. The R^2 values on the liability scale were
   0.099 and 0.344 for breast and prostate cancer, respectively
   (Supplementary Table [65]9).

Monogenic variant carriers show elevated risk across PRS groups

   The intermediate-PRS group without any monogenic variants was selected
   as the reference group for computing odds ratios. We compared the odds
   ratios of cancer risk for each PRS group—both with and without
   monogenic variants—with the intermediate-PRS group devoid of monogenic
   variants. The odds ratios of the PRS groups without monogenic variants
   were consistent with those of PRS alone, where monogenic variants were
   not considered (Fig. [66]1, Supplementary Fig. [67]1).

Fig. 1. Cancer odds ratio among individuals categorized according to the
presence of monogenic variants and polygenic risk scores (PRS).

   [68]Fig. 1
   [69]Open in a new tab

   Participants with or without monogenic variants were categorized into
   three strata based on their PRS (low, intermediate, or high). The odds
   ratio and 95% confidence intervals were calculated from a logistic
   regression model for (a) breast and (b) prostate cancer.

   In all three stratified PRS groups, those with monogenic rare
   pathogenic variants, annotated by ClinVar guidelines, AlphaMissense,
   and PrimateAI 3D within both breast and prostate cancer candidate gene
   regions, exhibited an increased risk. The risk of carrier samples
   showed increased odds ratios across all three PRS groups for breast
   cancer: 1.16, 2.03, and 4.69 for low, intermediate, and high,
   respectively. This trend was also observed in prostate cancer, with
   odds ratios of 0.62, 1.75, and 10.13 for low, intermediate, and high,
   respectively. For both cancer types, the highest odds ratio was
   observed in the high-PRS group with monogenic variants, and the lowest
   odds ratio was observed in the low-PRS group without monogenic
   variants. Specifically, the odds ratio of the high-PRS group without
   carriers was higher than that of the low-PRS group with carriers for
   breast cancer (Supplementary Table [70]3, Fig. [71]1).

Genes prioritized and clustered based on odds ratios and PAF

   In our analysis of the 23 candidate genes from both breast and prostate
   cancers, we prioritized known cancer-specific causal genes based on
   their odds ratios and PAFs (Supplementary Table [72]4). Density-based
   spatial clustering of applications with noise (DBSCAN) was used for
   clustering the monogenic effect of prioritized genes. For breast
   cancer, BRCA1, BRCA2, PALB2, CHEK2, and ATM were prioritized from the
   23 genes. Among these, BRCA1 exhibited the highest odds ratio at 9.46
   (P = 5.46 × 10^−²⁰). The PAF for BRCA2 was the highest at 0.0053. The
   pathogenic variant count was highest for CHEK2, with 457 occurrences
   (Table [73]1). We clustered the five breast cancer-specific genes based
   on their odds ratios and PAFs, resulting in two distinct groups
   (Supplementary Fig. [74]2a).

Table 1.

   Odds ratios and population-attributable fractions of genes for each
   cancer. Odds ratio and PAFs of genes were computed and used for
   clustering monogenic effects for each cancer
   Gene Carrier (n) OR (CI 95%) P value PAF Monogenic effect cluster
   Breast cancer BRCA1 69 9.46 (5.85–15.31) 5.46E−20 0.0024 2
   BRCA2 216 5.43 (4.12–7.15) 2.11E−33 0.0053 2
   PALB2 182 3.27 (2.37–4.52) 6.29E−13 0.0027 1
   CHEK2 457 1.66 (1.3–2.1) 3.82E−05 0.0027 1
   ATM 370 2.52 (1.98–3.2) 6.34E−14 0.0039 1
   Prostate cancer HOXB13 398 3.12 (2.47–3.94) 1.78E−21 0.0072 2
   BRCA2 208 1.93 (1.33–2.78) 4.76E−04 0.0021 1
   ATM 314 2.17 (1.62–2.89) 1.51E−07 0.0038 1
   [75]Open in a new tab

   PAF population-attributable fraction, OR odds ratio

   For prostate cancer, HOXB13, BRCA2, and ATM were prioritized
   (Table [76]1). Of these, HOXB13 showed the highest odds ratio at 3.12
   (P = 1.78 × 10⁻^21), a PAF of 0.0072, and a pathogenic carrier count of
   398. The three genes were differentiated into two groups (Supplementary
   Fig. [77]2b).

Cancer risk according to stratified risk groups

   Separate analyses were conducted for each gene cluster, two clusters
   each for both breast cancer and prostate cancer. The odds ratios for
   cancer across groups were calculated using the intermediate-PRS group
   without monogenic variants as a reference. All gene clusters and
   samples without variants exhibited the highest odds ratios in the order
   of high-, intermediate-, and low-PRS risk groups (Table [78]2).

Table 2.

   Odds ratio of risk groups stratified using monogenic effect clusters
   and polygenic risk scores for breast and prostate cancer
   Monogenic effect cluster PRS risk OR (CI 95%) P-value Sample (n)
   breast cancer Low 0.52 (0.49–0.55) 7.88E−115 39,664
   No variant Intermediate 1 (0.95–1.05) 0.00E+00 39,682
   High 1.96 (1.87–2.05) 1.98E−196 39,716
   Low 1.49 (1.1–2.02) 1.09E−02 362
   1 Intermediate 2.12 (1.6–2.82) 1.84E−07 334
   High 5.71 (4.49–7.27) 1.02E−45 311
   Low 5.71 (3.71–8.78) 2.30E−15 95
   2 Intermediate 9 (6.01–13.49) 1.79E−26 101
   High 15.18 (9.79–23.52) 5.04E−34 89
   prostate cancer Low 0.52 (0.48–0.58) 2.65E−42 35,542
   No variant Intermediate 1 (0.93–1.08) 0.00E+00 35,485
   High 6.02 (5.65–6.41) 0.00E+00 35,370
   Low 1.38 (0.69–2.79) 3.65E−01 165
   1 Intermediate 2.35 (1.25–4.42) 7.97E−03 142
   High 15.58 (11.36–21.36) 3.41E−65 215
   Low 0.44 (0.11–1.82) 2.60E−01 100
   2 Intermediate 3.55 (2.03–6.19) 8.34E−06 130
   High 22.41 (16.08–31.24) 2.77E−75 168
   [79]Open in a new tab

   OR odds ratio, PRS polygenic risk score

   In breast cancer, gene cluster 1 demonstrated odds ratios of 1.49,
   2.12, and 5.71 for low-, intermediate-, and high-PRS risk groups
   (P = 1.09 × 10^−^2, 1.84 × 10^−^7 and 1.02 × 10^−^45), respectively.
   The odds ratio of gene cluster 1 for the low-PRS risk group was lower
   than that of the no-variant high-PRS risk group (1.49 vs. 1.96)
   (Table [80]2). The odds ratios of the low-, intermediate- and high-PRS
   risk groups of gene cluster 2 significantly increased to 5.71, 9.00 and
   15.18 (P = 2.30 × 10^−^15, 1.79 × 10^−^26and 5.04 × 10^−^34),
   respectively with 15.18 being the highest among all risk groups
   (Fig. [81]2a, Table [82]2).

Fig. 2. Cancer odds ratio among individuals categorized according to
monogenic effect clusters and polygenic risk score (PRS).

   [83]Fig. 2
   [84]Open in a new tab

   Participants were categorized into three strata based on their PRS
   (low, intermediate, or high) and monogenic effect cluster. The odds
   ratios and 95% confidence intervals (CIs) were calculated from a
   logistic regression model for (a) breast and (b) prostate cancer.

   For prostate cancer, the odds ratios for low-, intermediate-, and
   high-PRS risk groups in gene cluster 1 were 1.38, 2.35, and 15.58,
   respectively. The odds ratio for the low-PRS risk group in gene cluster
   1 was lower than that of the no-variant high-PRS risk group (1.38 vs.
   6.02). Additionally, the odds ratios for the low-, intermediate-, and
   high-PRS risk groups, when associated with gene cluster 2, were all
   significant: 0.44, 3.55, and 22.41, respectively. The low-PRS risk
   group in gene cluster 2 had the lowest odds ratio among all risk groups
   at 0.44, which was lower than that for the low-PRS risk group without
   variants. Notably, there was a significant increase in odds ratios
   within the high-PRS risk group across every monogenic effect cluster
   (Fig. [85]2b, Table [86]2).

Validation of the EPRS model

   To substantiate the robustness of our EPRS model, we conducted a
   validation using 5-fold cross-validation within the UK Biobank dataset.
   We classified samples based on their PRS group and whether they carried
   pathogenic variants in specific gene clusters. Logistic regression
   models were then fitted to the training data, genetic data with cancer
   status. This process tested the EPRS across three distinct PRS groups
   and two gene clusters, employing the Area Under the Curve (AUC) metric
   to gauge predictive accuracy. The resulting mean AUC values were 0.632
   for breast cancer and 0.734 for prostate cancer, affirming the model’s
   efficacy in stratifying cancer risk.

Prevalence of cancer risk stratified by EPRS

   We estimated the stratified cancer risk by EPRS, considering both PRS
   risk groups and gene clusters, and assessed their influence on the
   prevalence of each cancer. The prevalence was calculated as the
   proportion of individuals within the cohort who had either a prior
   diagnosis of cancer or developed cancer during the study period. This
   allowed us to capture the overall burden of disease, including both
   pre-existing and incident cases, to assess the ability of EPRS to
   stratify risk across a comprehensive population. The mean prevalence of
   each gene within a gene cluster per PRS risk group was calculated to
   demonstrate the combined effect of PRS risk and gene clusters for risk
   stratification.

   The highest prevalence of the breast cancer risk group, high-PRS risk
   in gene cluster 2, was 0.59. The low-PRS risk group without any
   pathogenic variants had the lowest prevalence at 0.053. (Supplementary
   Table [87]5, Fig. [88]3a). A consistent increase in cancer prevalence
   was observed as both PRS risk and monogenic effect clusters increased.

Fig. 3. Prevalence of risk groups.

   [89]Fig. 3
   [90]Open in a new tab

   The prevalence of stratified risk groups according to their monogenic
   effect cluster and polygenic risk score (PRS) risk group were
   calculated for UK Biobank samples for (a) breast and (b) prostate
   cancer.

   Among the prostate cancer risk groups, in gene cluster 2, the high-PRS
   group had the highest prevalence at 0.46—including only the HOXB13
   gene. The lowest prevalence was observed in the low-PRS risk group in
   gene cluster 2 at 0.020, followed by the low-PRS group without any
   pathogenic variants at 0.024 (Supplementary Table [91]6, Fig. [92]3b).

   The prevalence of groups—constructed by EPRS—increased with increasing
   PRS in the order of low-, intermediate-, and high-risk groups.
   Additionally, a sequential increase was noted in gene clusters for
   breast cancer. In prostate cancer, the prevalence in the high-PRS group
   without any pathogenic variants surpassed those in the intermediate-PRS
   groups with both gene clusters 1 and 2.

Discussion

   Disease risk can be stratified according to PRS in conjunction with
   monogenic variants in high-risk genes^[93]11,[94]13,[95]21. Using the
   EPRS approach, we systematically categorized monogenic variants by
   clustering risk genes using odds ratios and PAF values and then
   assessed the extent to which PRS influences each cluster in breast and
   prostate cancers. Through EPRS, we were able to observe the
   contributions of monogenic and polygenic effects on cancer risk,
   improving the understanding of the genetic profile influencing cancer
   risk.

   PRS demonstrated significance in stratifying both breast and prostate
   cancer risk. The odds ratios of the low- and high-PRS groups for both
   cancers significantly differed from that of the intermediate-PRS group
   in both analyses. These findings demonstrate that the cumulative effect
   of SNPs increases the risk of cancer, indicating that PRS alone can be
   utilized to stratify the risk of an individual. We used three different
   summary statistics for each cancer to construct PRS and applied
   PRSice2, LDpred2, and SbayesR. The performance of PRS was evaluated
   using R^2 on the liability scale, and the best-performing methods were
   selected: LDpred2 using summary statistics from Zhang et al. ^[96]29
   for breast cancer, and PRSice2 using summary statistics from Wang et
   al. ^[97]30 for prostate cancer (Supplementary
   Table [98]9)^[99]29,[100]30.

   In addition, by incorporating the monogenic variant effect, we also
   observed increased cancer risk in each PRS group with pathogenic
   variants compared with that of PRS groups without any pathogenic
   variants. Pathogenic variants increase cancer incidence by interrupting
   metabolic pathways^[101]13,[102]16. In our study, cancer risk varied by
   PRS group and the presence of variants. Notably, the intermediate-PRS
   group with variants exhibited a significantly increased cancer risk
   compared with the high-PRS group without variants. Moreover, the
   high-PRS group with variants for both cancers displayed the highest
   risk among all groups. Samples with monogenic variants in each PRS risk
   group demonstrated an up to 2-fold higher risk than those without
   monogenic variants for both cancers. Interestingly, the low-PRS group
   with monogenic variants slightly exceeded the risk of the
   intermediate-PRS group without variants for both cancers. Although the
   odds ratios were not significant, these findings suggest that monogenic
   effects can amplify the risk in samples with low polygenic effects.

   We observed stratified risks in each group, depending on the absence or
   presence of monogenic variants. However, given that the impact of the
   presence or absence of monogenic variants can have a considerably more
   critical effect on risk than SNPs, applying the summation of genetic
   effects, often used for PRS construction, to monogenic variants may not
   fully represent the genetic risk of disease. Furthermore, the risk
   associated with monogenic variants can vary depending on the specific
   gene hosting that variant. Previous studies have primarily focused on
   the elevated cancer risks associated with well-known risk-increasing
   genes in conjunction with PRS^[103]13,[104]22,[105]23. Notably, our
   EPRS approach provides a systematic method for prioritizing and
   clustering monogenic effects and integrating them with PRS, thereby
   refining cancer risk stratification. In our EPRS approach, we
   prioritized genes specific to each type of cancer and clustered the
   monogenic effects based on their odds ratios and PAFs. By selecting
   genes with odds ratio Bonferroni-adjusted P-values less than 0.0022 and
   PAF values greater than 0, we were able to highlight the genes most
   significantly affecting each cancer type. This approach aligned with
   previous studies that identified genes, such as ATM, BRCA1, BRCA2,
   CHEK2, and PALB2 as associated with an increased risk of breast
   cancer^[106]26,[107]31, and HOXB13 and BRCA2 with prostate
   cancer^[108]5,[109]27. We then clustered the identified genes to
   estimate their associated cancer risk. In breast cancer, we identified
   five genes grouped into two distinct monogenic effect clusters.
   Monogenic effect clusters 1 and 2 showed moderate- and high-risk
   effects, respectively. In prostate cancer, we identified three genes
   clustered into two groups. Monogenic effect clusters 1 and 2
   demonstrated moderate- and high-risk effects on prostate cancer.

   Additionally, we incorporated these cluster effects within PRS groups,
   facilitating a more detailed subdivision of risk stratification and its
   characteristics. When combined with PRS, all monogenic clusters
   increased the risk for both breast and prostate cancers. However, the
   degree of increase varied depending on the monogenic cluster effects,
   which were revealed through the odds ratios of each risk group. In
   breast cancer, monogenic effect clusters 1 and 2 demonstrated moderate-
   and high-risk effects, respectively, when combined with PRS. A total of
   1292 samples were newly classified into different groups compared with
   those stratified by PRS risk alone. The odds ratios demonstrated a
   concurrent elevation in breast cancer risk influenced by both polygenic
   and monogenic effects. Samples in gene cluster 2 within the low-PRS
   risk category exhibited a higher odds ratio than even the high-PRS risk
   group without variants. High-PRS risk samples in clusters 1 and 2
   displayed higher odds ratios than the high-PRS risk group with
   unclustered variants. This detailed stratification of monogenic effect
   clusters allowed us to observe more specific risk differences. Applying
   the EPRS approach to prostate cancer resulted in 817 samples being
   reclassified into different risk groups compared to those obtained
   using PRS alone. Similar to breast cancer, monogenic clusters 1 and 2
   in prostate cancer demonstrated moderate and high risks, respectively,
   when combined with PRS. Despite this, the polygenic effects were more
   pronounced than the monogenic effects in our analysis. PRS showed
   robust predictive performance with an R² of 0.344 on the liability
   scale, presenting a steep increase in odds ratios through PRS risk
   levels. The highest and lowest odds ratios were observed in the high-
   and low-PRS risk groups of gene cluster 2, respectively. Therefore, our
   analysis enhanced the distinction of cancer risk groups beyond the
   scope of PRS alone by incorporating monogenic effect clusters.

   Our findings highlight that the combined effect of PRS and monogenic
   clusters can substantially influence cancer risk. This was also evident
   in the observed prevalence of cancer among risk groups. For breast
   cancer, monogenic cluster 2 demonstrated a higher prevalence across all
   PRS groups; specifically, the high-PRS group had more than half of the
   samples in the risk group diagnosed with breast cancer. BRCA1 and
   BRCA2, which were identified as critical genetic variants causing
   breast cancer^[110]26,[111]32–[112]35, were clustered in monogenic
   effect cluster 2. Specifically, in the high-PRS group, both genes
   exhibited high prevalence values: 0.67 for BRCA1 and 0.52 for BRCA2
   (Supplementary Table [113]7). The prevalence of prostate cancer among
   risk groups also varied according to their PRS group and monogenic
   cluster effect, with a trend of increasing prevalence as the PRS risk
   group ascended. Monogenic effect cluster 2 exclusively contained
   HOXB13, a causative gene of prostate cancer^[114]23,[115]27. This gene
   demonstrated a higher prevalence than that of BRCA2 and ATM in cluster
   1 for prostate cancer (Supplementary Table [116]8).

   In breast cancer, pathway enrichment analysis revealed significant
   involvement in DNA damage response and repair mechanisms. Both clusters
   shared enrichment in critical pathways such as DNA double-strand break
   repair, cellular response to DNA damage, and cell cycle checkpoints,
   underscoring their collective role in maintaining genomic stability.
   Key gene ontology biological processes associated with these pathways
   include double-strand break repair, DNA repair, and signal transduction
   in response to DNA damage. Despite their common roles, each cluster
   also exhibited unique pathway enrichments. BRCA1 and BRCA2 of cluster 1
   were uniquely involved in pathways like the ATR-BRCA pathway and
   homology-directed repair, emphasizing their roles in precise DNA repair
   through homologous recombination and apoptotic signaling. Conversely,
   ATM, PALB2, and CHEK2 of cluster 2 were enriched in pathways related to
   diseases of DNA repair and response to ionizing radiation, highlighting
   their roles in signaling and repair processes under stress conditions.

   For the pathway enrichment analysis of prostate cancer, two gene
   clusters emerged: one involving HOXB13 and another comprising ATM and
   BRCA2. Both clusters shared involvement in developmental and
   differentiation pathways, such as gland development, reproductive
   system development, and cellular growth. However, each cluster also
   exhibited unique pathway enrichments, reflecting their distinct
   functions. The ATM and BRCA2 of cluster 2 are enriched in DNA repair
   and damage response pathways, including homologous recombination, the
   ATR-BRCA pathway, and the DNA repair complex, emphasizing their
   critical roles in genomic stability and preventing mutation
   propagation. In contrast, HOXB13 is uniquely enriched in pathways
   related to cellular growth and maturation, indicating its pivotal role
   in development and differentiation.

   This study has some limitations. First, different partitioning criteria
   for PRS could have potentially shown better performance in cancer risk
   prediction compared to equal-sized tertiles. Various partitioning
   criteria, ranging from 5% to 35% in increments of 5% for both top and
   bottom percentages, were applied to our EPRS approach. The mean AUC was
   calculated using 5-fold cross-validation, and the 30% partitioning
   criteria yielded the best performance for breast cancer, while 20% was
   optimal for prostate cancer (Supplementary Table [117]10). This
   approach effectively segregated high- and low-PRS risk groups. The
   highest and lowest prevalence of risk groups remained the same, with
   high-PRS and gene cluster 2 showing the highest risk, and low-PRS with
   no monogenic variant showing the lowest risk. The prevalence increased
   and decreased accordingly in these groups (Supplementary Fig. [118]3).
   However, the optimal partitioning criteria varied for different cancer
   types. Although we explored various segregation percentages, there may
   still be room for improvement as PRS is a continuous variable. Future
   applications of different segregation methods may enhance the
   understanding of polygenic effects. Nevertheless, the current study
   primarily focused on the potential of systematic risk stratification
   using genetic profiles. The second limitation is a potential bias in
   selecting genes for systematic prioritization and monogenic effect
   clustering. The challenge of considering all functional genes is
   substantial. Given their shared causal genes, we focused on breast and
   prostate cancers to lessen the computational and financial burdens of
   analysis. Candidate genes specific to each cancer were selected based
   on previous studies^[119]5,[120]25–[121]27. Although all pathogenic
   variants were accounted for regardless of the type of candidate cancer
   gene, each cancer-specific gene was prioritized, and cancer-specific
   monogenic clusters showed an elevated effect on cancer risk. However,
   further research into additional diseases is necessary for more precise
   systematic risk stratification. The third limitation of this study
   arises from the recruitment bias of UK Biobank. This cohort
   predominantly consists of participants who are older, more educated,
   and of European ancestry. Moreover, these participants generally
   exhibit healthier lifestyles, and a lower prevalence of several health
   conditions compared to the general UK population. They are notably less
   likely to be obese, smoke, or consume alcohol daily. These
   characteristics suggest a ‘healthy volunteer’ bias, which may affect
   the generalizability of our findings, including the calculated
   Population Attributable Fractions (PAFs) and the observed cancer
   prevalence. To overcome these limitations, further studies should
   consider incorporating a broader population^[122]36.

   In summary, this study aimed to systematically stratify the risk of
   cancers by clustering genes with pathogenic variants based on odds
   ratios and PAF, which are used to infer risk levels elevated by rare
   variants. We addressed this using sequencing data from the UK Biobank.
   Our findings suggest that, for breast cancer, relying solely on the
   popular value odds ratios may be insufficient to fully capture the risk
   contribution of certain genes. In our study, the well-known genes BRCA2
   and PALB2 were similar in terms of odds ratios but differed in PAF,
   leading to their placement in different clusters. These clusters
   displayed distinguishable patterns in terms of breast cancer
   prevalence. Moreover, when combined with a polygenic risk score based
   on common variants, data regarding individuals with rare variants in
   BRCA1 and BRCA2 showed stratified patterns of cancer prevalence
   depending on PRS level. Similar findings were observed for prostate
   cancer. Therefore, we suggest that when considering risk stratification
   for cancer, it is beneficial to focus on both rare and common variant
   information, incorporating metrics such as PAF in addition to odds
   ratios for the estimation of rare variant gene effects.

   However, we acknowledge certain limitations in our approach. Gene
   selection was based on a literature review, which may introduce bias
   and potentially miss significant genes. Furthermore, the pathogenicity
   assessment of rare variants and gene clustering methods relied on
   subjective thresholds. The optimal PRS grouping thresholds varied
   between traits, suggesting that some may be trait-specific. While more
   quantitative and statistically rigorous methods, such as gene-based
   common variant scores combined with rare-variant burden tests, could
   provide a more robust framework for stratification, these methods
   require a much larger set of genes and significantly more data. To
   apply such methods effectively, future studies will require larger
   sample sizes and more comprehensive datasets, such as those obtained
   through whole genome sequencing (WGS). Incorporating WGS data would
   allow for the inclusion of a broader spectrum of variants, thereby
   providing a more detailed understanding of the roles of low-frequency,
   rare, and somatic variants in cancer risk. Such enhancements would
   significantly improve the precision and accuracy of systematic risk
   stratification.

Patients and methods

Data source

   This study utilized genetic and phenotypic data from the UK Biobank
   (application ID 72128), a large-scale health cohort study designed to
   provide robust statistical power for various analyses. Data collection
   was conducted from 2006 to 2010, encompassing more than 500,000
   participants aged between 40 and 69 years from multiple assessment
   centers, primarily in England, Scotland, and Wales. The UK Biobank
   performed high-quality genome-wide genotyping and genotype imputation,
   leveraging a reference panel from the Haplotype Reference
   Consortium^[123]37. Despite the overall healthier status of the cohort,
   lower prevalence of obesity, and reduced incidence of smoking or
   alcohol consumption, it is considered a representative sample of the
   white British population in the United Kingdom^[124]36. All ethical
   regulations relevant to human research participants were followed.

Study participants

   This study identified cancer cases according to the International
   Classification of Diseases (ICD) -10 code in data field 41270, ICD-9
   code in data field 41271, or self-reported cancer code in data field
   20001. For breast cancer, female participants exhibiting ICD-10 code
   C50X, ICD-9 code 174X, or self-reported cancer code 1002 were
   categorized as cases. Similarly, male samples with ICD-10 code C61,
   ICD-9 code 185X, or self-reported cancer code 1044 were classified as
   prostate cancer cases. Samples devoid of other cancer diagnoses or
   self-reported cancer were assigned as controls.

Sample and genotype quality control

   To analyze individuals with relatively homogeneous ancestry and owing
   to the small percentages of non-British individuals, the present
   analysis was restricted to white British ancestry individuals.
   Genetically confirmed ancestry was used to identify this subgroup,
   utilizing principal components in data field 22020. Exclusion criteria
   encompassed putative sex chromosome aneuploidy, which refers to samples
   identified as potentially carrying sex chromosome configurations other
   than the typical XX or XY, found in data field 22019, genetic kinship
   with other participants as indicated in data field 22021, and
   withdrawal of informed consent, identified centrally. Our study
   initially involved 454,711 exome sequencing samples and 487,409 array
   samples. Following quality control of the samples with imputed genotype
   data, 333,990 samples remained. The intersection of these datasets
   resulted in a total of 311,225 samples that were utilized for the
   analyses (Supplementary Fig. [125]4). In addition, variants
   demonstrating an imputation quality score below 0.7, minor allele
   frequency under 0.01, or missing genotype rate exceeding 0.05, were
   excluded. After these quality control measures were applied to imputed
   genotype data, a total of 8,700,879 variants remained for our analyses
   (Supplementary Fig. [126]5). The quality control processes were
   executed using Plink software version 2.0.

Calculating polygenic risk scores from array data

   The PRS were calculated using three distinct tools: PRSice2 (version
   2.3.3)^[127]38, LDpred2^[128]39, and SbayesR^[129]40, each offering
   unique methodologies to enhance the accuracy and applicability of PRS.
   PRSice2 clusters single nucleotide polymorphisms (SNPs) based on
   linkage disequilibrium (LD) and P-value, followed by P-value
   thresholding, using default clumping options (–clump-kb 250 kb, clump-p
   1, and clump-r2 0.1). LDpred2 utilizes a point-normal prior for SNP
   effect sizes and employs a Markov Chain Monte-Carlo (MCMC) procedure to
   infer posterior mean effect sizes. SbayesR performs Bayesian posterior
   inference to accommodate SNPs with small, medium, and large effects,
   thus allowing for more general effect size distributions^[130]41. For
   each cancer type, summary statistics were derived from genome-wide
   association studies^[131]19,[132]29,[133]30,[134]42–[135]44.

   Among these tools and summary statistics, we selected the most
   effective combination for each cancer based on R^2 performance on the
   liability scale (Supplementary Table [136]9). Specifically, we used
   LDpred2 with Zhang et al. ^[137]29 for breast cancer and PRSice2 with
   Wang et al. ^[138]30 for prostate cancer^[139]29,[140]30.

Selection of candidate genes for identifying rare variant

   Based on previous research findings, we identified 23 candidate genes
   to investigate pathogenic rare variant for breast and prostate cancer.
   For breast cancer, we selected the susceptibility genes frequently
   identified in sequencing panels and genes housing protein-truncating
   variants associated with overall breast cancer^[141]25,[142]26. For
   prostate cancer, we initially used the gene list known to heighten the
   risk of prostate cancer due to germline variants^[143]5. In addition,
   we included the overlapping genes from the list associated with
   hereditary prostate cancer^[144]27.

Identification of pathogenic variants

   Pathogenic variants were identified using the dx command-line client of
   the DNAnexus Platform SDK, which allowed us to download WES variant
   call format (VCF) files from the UKB Research Analysis Platform (RAP).
   We selected WES population VCF (pVCF) files based on the location of
   candidate genes, creating target regions of the genes using bedtools
   version 2.30.0 of the app-swiss-army-knife in the dx client. We
   downloaded each resulting region-based VCF file via RAP DNAnexus.
   Annotation of VCF files was achieved through an in-house pipeline, with
   individual variants annotated using SnpEff (5.0e) and dbNSFP (4.2c)
   software^[145]45,[146]46. The criteria for pathogenic variant selection
   were as follows: 1) variants deemed pathogenic or likely pathogenic by
   ClinVar; 2) variants with a ClinVar review status of two stars or
   higher; and 3) variants classified as pathogenic or likely pathogenic
   according to ClinVar, provided the variant was consistently categorized
   as such or had more than five pathogenic or likely pathogenic
   annotations without any benign/likely benign classification. We used
   the ClinVar database updated as of June 3, 2024. Additionally, we
   incorporated information from AlphaMissense^[147]47 and
   PrimateAI^[148]48 to predict variant pathogenicity. Both algorithms
   utilize primate variant population frequency databases to predict the
   pathogenicity of missense variants. AlphaMissense adapts AlphaFold
   fine-tuned on human data, while PrimateAI employs deep neural networks.
   Both algorithms provide pathogenicity scores ranging from 0 to 1 and
   classify variants based on specific thresholds. In AlphaMissense,
   variants with a pathogenicity value over 0.9 are considered
   pathogenic^[149]49 while PrimateAI classifies pathogenic variants as
   ‘deleterious’. Only variants meeting both criteria were classified as
   pathogenic and included in our analysis.

Prioritization and clustering of monogenic effects

   To prioritize the genes for specific cancer types, we considered all 23
   candidate genes for each cancer. We identified samples carrying
   pathogenic variants within these genes, treating the presence of a
   pathogenic variant as an interruption of the gene’s function. Thus, any
   individual harboring a pathogenic variant was considered a carrier.
   With the cancer phenotype, we calculated the odds ratio and PAF for
   each gene. For each gene, we computed the odds ratios using a logistic
   regression model, controlling for age at recruitment and the first four
   principal components (PCs). Simultaneously, we computed the PAF based
   on the estimated relative risk associated with the presence of a
   monogenic variant and the prevalence of the variant among
   cases^[150]28. We calculated the PAF for each gene to determine the
   fraction of each cancer attributable to rare variants in that gene.
   Genes demonstrating odds ratio Bonferroni-adjusted P-values less than
   0.0022 and PAF values greater than 0 were selected. This approach
   ensured the identification of genes exerting a significant influence on
   each cancer type. Subsequently, we grouped these prioritized genes to
   estimate their collective cancer risk, employing DBSCAN. This method
   incorporated both the log 10 of odds ratio and PAF for each gene to
   adjust the scale of both values. In our approach, this clustering
   synthesizes multiple rare variants within a gene into a single
   monogenic effect, facilitating a comprehensive analysis of their
   combined impact on cancer risk.

Validation of the EPRS model

   We conducted 5-fold cross-validation within the UK Biobank dataset. We
   included the process involved testing the EPRS across various
   partitioning criteria to identify the optimal PRS group thresholds for
   various cancer type (Supplementary Table [151]10).

   First, we partitioned the PRS into three distinct groups based on
   specific quantile thresholds. The top and bottom quantiles tested were
   5%, 10%, 15%, 20%, 25%, 30%, 33.3%, and 35%. This approach allowed us
   to determine the most effective thresholds for stratifying individuals
   into low, intermediate, and high-PRS groups.

   Next, we merged these PRS groups with phenotypic and genetic cluster
   data. Samples were classified based on their PRS group and whether they
   carried pathogenic variants in specific gene clusters. Logistic
   regression models were then fitted to this data, adjusting for the
   effects of age and the first four principal components.

   We performed 5-fold cross-validation to evaluate the model’s
   performance. The dataset was split into five subsets, where the model
   was trained on four subsets and tested on the remaining one, rotating
   through all subsets. We calculated the Area Under the Curve (AUC)
   metric to gauge predictive performance.

Statistics and reproducibility

   We stratified individuals based on PRS tertiles and the presence or
   absence of pathogenic variants and divided them into equal-sized PRS
   groups (low, intermediate, and high). The intermediate-PRS group was
   used as a reference for computing the odds ratios to assess cancer
   prevalence in the population. Initial calculations of odds ratios were
   performed for the three PRS groups, demonstrating risk stratification
   using PRS only. Our analysis considered all pathogenic variants in
   candidate genes, using the intermediate-PRS group without any
   pathogenic variants as a reference. For the monogenic cluster, we
   computed the odds ratio for each gene harboring pathogenic variants and
   the PAF for each gene, indicating the fraction of cancer attributable
   to the interrupted gene. Using these two measures, we classified the
   effect of monogenic variants through DBSCAN. The reference group was
   also used to compute the odds ratio for the specific effect of the
   monogenic cluster on cancer. For each group, we used a logistic
   regression model adjusted for age at recruitment and the first four PCs
   to compute the odds ratio. To substantiate the robustness of our EPRS
   model, we conducted a validation using 5-fold cross-validation within
   the UK Biobank dataset employing the mean AUC metric to gauge
   predictive performance. All statistical analyses were conducted using
   Python version 3.7.9, with modules statsmodels 0.13.2 and scikit-learn
   1.1.1. All plots were created using matplotlib 3.5.2 and seaborn
   0.11.2.

Reporting summary

   Further information on research design is available in the [152]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [153]Supplementary Information^ (277.5KB, pdf)
   [154]42003_2024_6995_MOESM2_ESM.pdf^ (104.9KB, pdf)

   Description of Additional Supplementary Files
   [155]Supplementary data^ (10.7KB, xlsx)
   [156]Reporting Summary^ (73.7KB, pdf)

Acknowledgements