Abstract

   The rate and pattern of mutagenesis in cancer genomes is significantly
   influenced by DNA accessibility and active biological processes. Here
   we show that efficient sites of replication initiation drive and
   modulate specific mutational processes in cancer. Sites of replication
   initiation impede nucleotide excision repair in melanoma and are
   off-targets for activation-induced deaminase (AICDA) activity in
   lymphomas. Using ductal pancreatic adenocarcinoma as a cancer model, we
   demonstrate that the initiation of DNA synthesis is error-prone at
   G-quadruplex-forming sequences in tumours displaying markers of
   replication stress, resulting in a previously recognised but
   uncharacterised mutational signature. Finally, we demonstrate that
   replication origins serve as hotspots for genomic rearrangements,
   including structural and copy number variations. These findings reveal
   replication origins as functional determinants of tumour biology and
   demonstrate that replication initiation both passively and actively
   drives focal mutagenesis in cancer genomes.

   Subject terms: Cancer genomics, Mutation, Computational biology and
   bioinformatics
     __________________________________________________________________

   The rate and pattern of mutagenesis in cancer genomes are shaped by DNA
   accessibility and biological processes. Here the authors show that
   efficient replication initiation sites drive cancer specific mutational
   processes, increasing the frequency of small mutations and genomic
   rearrangements at these sites.

Introduction

   Cancer cells are characterised by their ability to sustain cell growth
   and proliferation by evading suppressor signals and activating oncogene
   expression^[28]1. The overexpression of oncogenes such as c-Myc, h-RAS,
   or Cyclin E shortens the G1 phase of the cell cycle, expedites S phase
   entry, and increases levels of replication initiation^[29]2–[30]4. This
   accelerated S-phase entry is linked to replication stress and genome
   instability, in part due to origin over-usage and activation of dormant
   replication origins^[31]5. However, the impact of replication
   initiation on mutagenesis in human cancers has not been examined.

   We recently reported that a subset of highly efficient replication
   origins, conserved across a range of untransformed, immortalised and
   transformed human cell lines^[32]6, act as hotspots for mutagenesis in
   the germline^[33]7. These ‘constitutive’ replication origins are
   notably enriched within the active and early replicated regions of the
   human genome, particularly within gene bodies. The mutational processes
   operating at these sites thus contribute to elevated mutational burdens
   at gene promoters and splice junctions, impacting phenotypic diversity
   across human populations. Genomic regions in which replication
   initiates constitute fragile sites associated with DNA breaks and
   translocations in cancer^[34]4,[35]8, and origin activity has also been
   shown to drive translocations at the MYC locus in murine B cells,
   suggesting a potential link between replication origins and the
   development of lymphomas and other cancers^[36]9. Further, increased
   origin firing contributes to replication stress driven by oncogene
   activation and is linked to DNA breaks and rearrangements^[37]4 while
   overexpression of origin firing factors such as GINS1 and CDC45 in
   human colorectal cancer cells leads to aneuploidy^[38]10. Collectively,
   these studies highlight replication origin firing as a cancer-relevant
   trigger for genome instability.

   In this study, we report the identification and characterisation of
   mutational processes occurring at constitutive origins in cancer. Our
   results emphasise the major impact of replication initiation in shaping
   the mutational landscape of the human genome, adding replication
   origins to the expanding repertoire of genomic features influencing
   local mutation rates in cancer (reviewed in ref. ^[39]11).

Results

Pan-cancer analysis of mutation distribution reveals enrichment at
replication origins

   To systematically examine the frequency of somatic point mutations at
   constitutive origins, we considered mutation calls from the
   International Cancer Genome Consortium (ICGC) aggregated from 86 cancer
   projects. To evaluate a potential enrichment of single nucleotide
   variants, hereafter refer to as mutations, at constitutive origins
   compared to baseline levels, we first examined ‘isolated’ origins
   (refer to the Methods section for detailed information on the analysed
   origin sites) and found a ~1.7-fold enrichment at these sites compared
   to their flanks (Supplementary Fig. [40]1a). This was accompanied by an
   increase in minor allele frequencies (Supplementary Fig. [41]1b). We
   confirmed that constitutive origins located within origin clusters
   exhibit a similar mutational burden (Supplementary Fig. [42]1c, [43]d)
   supporting the use of ‘isolated’ origins to avoid potential biases
   introduced by clustering when estimating local changes in mutation
   rates. To account for local variation in base composition, we corrected
   the apparent mutation rates of each of the six pyrimidine mutations
   (and hence also reciprocally mutation rates at purines) by their
   occurrences and found increased mutation rates associated to each type
   (Fig. [44]1a). When adjusted for trinucleotide frequency, we observed
   increased mutation rates for all mutations except C > T (Supplementary
   Fig. [45]1e). However, we found that C > T transitions are favoured in
   a non-CpG context (Supplementary Fig. [46]1f), suggesting that
   constitutive origins protect against spontaneous CpG deamination, as
   observed in the germline^[47]7. These observations suggest that a
   specific form, or forms, of mutagenesis is focused on constitutive
   origins in cancer genomes. We note that mutation rates are negligible
   at ‘core’ origins, a class of cell type-independent origins (see the
   Methods section for detailed information)^[48]12, which do not overlap
   with constitutive origins, (Supplementary Fig. [49]1g). This suggests
   that constitutive origins represent the most recurrent sites of
   replication initiation in cancer genomes, accumulating a significant
   mutational imprint.

Fig. 1. Pan-cancer analysis of mutation distribution across the genome
reveals enrichment at constitutive origins.

   Fig. 1
   [50]Open in a new tab

   a Mutation rates associated with the six pyrimidine substitutions at
   constitutive origins. Mutation rates were computed from aggregated
   mutation calls from 86 ICGC cancer projects, corrected for local
   variation in base composition and background values. b Averaged and
   background adjusted mutation rates stratified by cancer types. Acronyms
   of cancer types are detailed in the Methods section. c Distribution of
   origins / origin flanks mutation density ratios of individual cancer
   samples stratified by cancer type and primary sites. Density ratios
   were computed by considering origin domains (origin midpoints ± 500 bp)
   and origin flanking domains (origin midpoints ± 10 kb excluding origin
   domains). Box plots show medians and interquartile ranges. Individual
   cancer samples are shown as grey dots. Only whole-genome-sequenced
   cancer samples with at least 5,000 mutations were considered for
   experiments reported in (b, c).

   We then asked whether mutation rates at constitutive origins vary
   across cancer types. We computed origin mutation rates for individual
   tumours with over 5000 mutations (1056 tumours across 20 cancer types)
   to ensure reliable estimates of mutation density at and near origins.
   While some cancers, like gall bladder cancer, show high mutation rates
   at origins (Fig. [51]1b), others, like renal or liver cancers,
   unexpectedly have constant or even negative rates. To assess variation
   within cancer types, we computed the ratios of mutations at origins
   compared to flanking regions and assessed their variability across
   cancer types (Fig. [52]1c). This revealed substantial differences
   within specific cancer types, pancreatic cancer, for instance, showing
   ratios ranging from 0.21 to 5.14. These observations suggest that
   increased mutation rates at origins are driven by mutational processes
   of varying degrees of intensity rather than inherent tumour properties.
   Supporting this, we found a strong correlation between the total number
   of mutations and those at origins (Rho = 0.954, Supplementary
   Fig. [53]1h), but an opposite correlation with the origin/flank
   mutation ratio (Rho = −0.246, Supplementary Fig. [54]1i). This pattern
   suggests that specific mutational processes operate at constitutive
   origins.

   To determine whether these mutational processes are cancer-specific or
   simply reflect origin usage in somatic tissues, we investigated
   mutation rates at constitutive origins in normal tissues. First, we
   examined mutation rates at origins in non-dividing cells across various
   tissues^[55]13 and found a mutational burden at constitutive origins
   similar to that of their flanks (Supplementary Fig. [56]2a), with no
   significant increase in mutation rates for any of the six pyrimidine
   mutations (Supplementary Fig. [57]2b). This suggests that mutagenesis
   at origins does not occur in the absence of replication. We then
   examined the distribution of mutations at constitutive origins in 18
   somatic cell types^[58]14 and observed an accumulation of SNVs at these
   sites (Supplementary Fig. [59]2c). However, after adjusting for local
   variations in base composition, the mutation rates for the six
   pyrimidines did not rise above background levels (Supplementary
   Fig. [60]2d,[61]e). This suggests that the apparent accumulation of
   mutations at origins in non-cancerous replicating tissues reflects
   local variation in base composition rather than active mutational
   processes. Overall, these findings support the existence of
   cancer-specific mutational processes targeting constitutive origins.

Origin-associated mutational signatures

   To identify and characterise these mutational processes, we clustered
   cancer samples based on the cosine similarity of their origin
   trinucleotide mutation signature (Fig. [62]2a). We chose a clustering
   approach over the application of non-negative matrix factorisation, as
   it enables the grouping of mutational patterns and exposures,
   facilitating the identification of mutational patterns that are present
   genome-wide but more prevalent at replication origins. Our analysis
   specifically focused on tumours with over 50 mutations at origins (227
   tumours across 17 cancer types) to ensure sufficient power to call
   distinct signatures. Each origin signature underwent adjustment for
   local trinucleotide composition and correction to account for
   background values from the flanking domains (details in Methods). The
   clustering revealed five distinct tumour groups, with two clusters
   exclusive to a specific cancer type and three including multiple types
   (Fig. [63]2b). Subsequently, representative mutation signatures were
   computed by aggregating mutations at origins from tumours within each
   cluster (Fig. [64]2c). These were compared to known signatures of
   somatic mutations in cancerous human tissues (COSMIC SBS signatures,
   Supplementary Fig. [65]3a)^[66]15. Further deconvolution of cluster 5,
   which visually appeared to contain several potential subclusters, did
   not reveal meaningful tumour groups (Supplementary Fig. [67]3b,[68]c).

Fig. 2. Signature analysis uncovers distinct mutagenic processes focused at
constitutive origins.

   [69]Fig. 2
   [70]Open in a new tab

   a Cancer sample clustering based on the cosine similarity of their
   origin trinucleotide mutational signature. Each signature at origin
   domains was corrected to adjust for local trinucleotide composition and
   background values from neighbouring origin flanking domains. Only
   whole-genome-sequenced cancer samples with at least 50 mutations at
   origins were considered. b Distribution of cancer types and primary
   sites within the five identified clusters of tumours. c
   Origin-associated mutational signatures for tumour clusters. Signatures
   were computed from aggregated mutation calls at origin domains for
   clustered cancer samples and corrected as previously. d Local exposure
   to mutational signatures associated with clustered tumours. Signature
   contributions are corrected for background origin flank values. e Total
   mutation count at origins and (f) origin/origin flanks mutation density
   ratios for individual cancer samples grouped by clusters. Box plot show
   medians and interquartile ranges. Individual cancer samples are
   represented as grey dots.

   Clusters 1 and 4 are specific to skin melanomas (MELA) and B cell
   malignant lymphomas (MALY) respectively, displaying characteristic
   origin signatures highly similar to the known signatures SBS7b (cosine
   similarity = 0.83, exposure to ultraviolet light) and SBS84 (cosine
   similarity = 0.88, activity of activation-induced cytidine deaminase
   (AICDA or AID)). Cluster 2 encompasses tumours from various cancer
   types, with pancreatic ductal adenocarcinoma (PACA), oesophageal
   adenocarcinoma (ESAD), and breast cancer (BRCA) being the primary
   contributors. The cluster 2 signature exhibits high similarity to SBS43
   (cosine similarity = 0.95), a signature of unknown aetiology, marked by
   T > G mutations in the specific context GTG. Clusters 3 and 5 comprise
   samples from over 10 different cancer types. The cluster 3 signature
   does not share similarities with any reported COSMIC signatures.
   Lastly, the aggregated cluster 5 signature appears featureless,
   resembling a ‘flat’ signature probably resulting from merging signals
   originating from diverse processes. It is noteworthy that clusters 3
   and 5 are the least resolved clusters in our analysis (Fig. [71]2a). We
   then compared these origin-associated signatures with genome-wide
   signatures derived from samples within the respective clusters
   (Supplementary Fig. [72]3d). Aside from cluster 1, our
   origin-associated signatures exhibit minimal resemblance to the
   corresponding genome-wide signatures. This finding validates our
   approach for identifying processes that focus mutagenesis at origins.

   To gain a deeper understanding of the specificity of our
   origin-associated signatures, we examined their local contribution near
   origins (Fig. [73]2d). The aggregate origin signatures derived for
   clusters 3 and 5 show no discernible contribution to origin
   mutagenesis. Conversely, processes linked to cluster 2 and 4 signatures
   focus mutations at origins, with a substantial 7- to 9-fold increase
   compared to origin flanks. The process related to the cluster 1
   signature exhibits some preference for origins, albeit with a moderate
   1.3-fold enrichment. We further quantified the number of mutations
   attributed to each signature at origins (Fig. [74]2e) and evaluated the
   origin specificity of associated processes (Fig. [75]2f) at the level
   of individual cancer samples. These analyses reveal that while the
   process at origins within cluster 1 lacks specificity for origins, it
   is enriched at origins and contributes significantly to origin
   mutagenesis in the MELA project samples. Conversely, clusters 2 and 4
   are highly specific to origins but contribute only moderately in terms
   of absolute mutation numbers. We validated these findings by fitting
   COSMIC SBS signatures to mutational profiles of origin domains
   (Supplementary Fig. [76]3e). Overall, our analysis identifies three
   independent mutational processes targeting mutagenesis at constitutive
   origins, two of which are tissue-specific, while one operates across
   different cancer types. Next, our aim was to identify the aetiology of
   these processes, beginning with those specific to particular tissues.

Differential DNA repair underlies mutation hotspots at origins in skin
melanoma

   Samples from cluster 1 exclusively originate from skin melanomas and
   exhibit both origin and genome-wide signatures that closely resembles
   the known COSMIC SBS7b signature (Fig. [77]2b, c and Supplementary
   Fig. [78]3a), which is linked to UV light exposure. However, the
   absolute contribution of SBS7b is ~2.5-fold higher at origins compared
   to their flanks (Supplementary Fig. [79]3e). This could be explained by
   either locally impaired DNA repair or a higher probability of
   UV-induced lesions at constitutive origins. Given that both protein
   binding and transcription initiation have been reported to impair
   nucleotide excision repair (NER)^[80]16,[81]17, we hypothesised that
   increased mutagenesis at origins may, by analogy, result from local NER
   deficiency due to the loading of the pre-replication complex at
   licensed origins.

   To assess this point, we used genome-wide maps of NER activity,
   obtained through XR-seq analysis of cyclobutane pyrimidine dimers (CPD)
   and pyrimidine-pyrimidone (6–4) photoproducts (6–4 PP) in irradiated
   skin fibroblasts^[82]18. We correlated the XR-seq profiles with DNase I
   hypersensitivity (DHS) at origins (Supplementary Fig. [83]4a) and
   observed a positive correlation between NER activity and origin
   accessibility (Rho = 0.40 and 0.17 for CPD and 6–4 PP respectively,
   Supplementary Fig. [84]4b) suggesting increased accessibility is
   associated with higher levels of damage. However, we found that the
   profile of mutation density at origins inversely mirrors NER activity
   (Fig. [85]3a and Supplementary Fig. [86]4c) and that the number of
   mutations at origins displayed a negative correlation with XR-seq
   signals at these sites (Rho = −0.44 and −0.36 for CPD and 6–4 PP
   respectively, Fig. [87]3b). Thus, replication initiation may
   specifically impair NER explaining the increased in SBS7b mutagenesis
   at origins in melanoma.

Fig. 3. Differential DNA repair and off-target AID deamination underlie
mutation hotspots at origins.

   [88]Fig. 3
   [89]Open in a new tab

   Nucleotide excision repair in ultraviolet-irradiated human cells
   inversely mirrors mutation density at origins in skin melanomas. a
   Average melanoma mutation density for cluster 1 tumour samples (red
   line) and strand-resolved XR-seq profiles for CPD (blue lines) in
   CSB/ERCC6 mutant NHF1 skin fibroblasts at constitutive origins. b
   Number of mutations at origins as a function of XR-seq signals for CPD
   (blue line) or 6–4 PP (red line). XR-seq signals were binned by read
   coverage. Mutation data represent the means and standard errors to the
   mean of values for each XR-seq signal bin. c Strand-resolved cluster 1
   mutational signature contribution to mutagenesis at the origins of
   cluster 1 tumours. Signature contribution was computed by considering
   aggregated mutation calls from cluster 1 tumours. Constitutive origins
   are off-targets for AID deamination in malignant B cell lymphomas. d
   Consensus contexts of mutations mapped within origin or origin flank
   domains for cluster 4 tumours. The central position represents the
   mutated bases. e Fraction of C > T mutations (orange line) or C
   residues (black line) overlapping the WRCY AID hotspot motif (where W
   represents weak bases, R purines, C the mutated bases and Y
   pyrimidines) at origins of cluster 4 tumours. f Pan-cancer
   origins/origin flanks mutation density ratios as a function of AID
   (AICDA) expression. Mutation density ratio values were computed for
   transcript per million (TPM) bins and represent the means and standard
   errors of the mean.

   To control for other known processes that may impair NER at
   constitutive origins, such as the binding of transcription factors
   (Supplementary Fig. [90]4d, [91]e), ref. ^[92]16 we compared NER
   activity at origins with or without experimentally validated binding
   sites of transcription factors, such as CTCF or ETS1. We found that the
   local NER deficiency observed at origins does not depend on
   transcription factor occupancy at these sites (Supplementary
   Fig. [93]4f). It worth noting that we performed this analysis using
   data from CSB/ERCC6 mutants to interrogate global NER rather
   transcription-coupled NER. We further ruled out the involvement of
   transcription-coupled NER at origins by observing that XR-seq profiles
   at intergenic origins were comparable to those observed for origins
   located at promoters or within gene bodies (Supplementary Fig. [94]3g).

   Interestingly, when examining strand-resolved XR-seq signals
   (Fig. [95]3a and Supplementary Fig. [96]4c), we observed a preference
   for NER activity on the lagging strand near constitutive origins. To
   determine whether this observation arises from a replication strand
   bias or from a skew in UV-reactive dinucleotides (such as TT
   dinucleotides), we adjusted XR-seq signals for dinucleotide composition
   skew (details in Methods) and calculated strand biases (Supplementary
   Fig. [97]3h). Despite the correction, NER activity still exhibited a
   bias toward the lagging strand. Subsequently, we evaluated the strand
   specificity of our cluster 1 origin-associated signature (Fig. [98]3c)
   and found no strand bias at the origins but an enhanced contribution on
   the lagging strand in close proximity to the origin. Taken together,
   these findings support NER deficiency at origins and increased NER on
   the lagging strand of replication forks translocating away from
   origins.

Origins are off-targets of cytosine deaminases in B cell lymphomas

   The cluster 4 signature (Fig. [99]2c) exhibits notable specificity for
   origins (Supplementary Fig. [100]2b) and bears significant similarity
   to the established COSMIC SBS84 signature associated with AID activity
   in lymphoid cancers. We hypothesised that origins might serve as
   off-targets for AID deamination. To investigate this point, we first
   examined the extended context of cluster 4 mutations at origins or
   their flanks (Supplementary Fig. [101]4i). Our analysis revealed that
   C > T mutations at origins occur within specific extended contexts,
   distinct from those observed at origin flanks, characterised by an
   enrichment of As, Gs, and Cs at the −2, −1, and +1 positions
   respectively relative to the C > T mutations. This mutation context
   closely resembles the known WRCY AID hotspot motif (where W represents
   weak bases, R purines, C the mutated base and Y
   pyrimidines)^[102]19,[103]20. Subsequently, by constructing consensus
   sequences for C > T mutations at origins and origin flanks
   (Fig. [104]3d), we confirmed that mutagenesis at origins in Cluster 4
   genomes aligns with this consensus sequence. Furthermore, to
   demonstrate the activity of AID at origins in lymphomas, we calculated
   the fraction of C > T mutations within the canonical WRCY motif around
   origins (Fig. [105]3e) and observed a local increase at origins.
   Interestingly, the fraction of Cs in a similar context decreased at
   origins, supporting that origins are indeed off-targets of AID. To
   reinforce this observation, we identified a pan-cancer correlation
   between mutational burden at origins and AID expression (Rho = 0.25,
   Fig. [106]3f).

   Another source of deamination in lymphomas and other cancer genomes
   arises from the activity of APOBEC deaminases^[107]21,[108]22. We
   therefore sought to determine whether APOBEC activity contributes to
   mutagenesis at origins. To investigate this, we initially examined
   whether high expression of APOBEC in lymphomas is associated with
   increased mutational burden at origins (Supplementary Fig. [109]4j).
   Our analysis revealed that the overexpression of APOBEC3 enzymes is
   indeed associated with higher origin/flank mutation density ratios,
   suggesting a potential contribution of APOBEC3 to origin mutagenesis.
   However, when assessing the background-adjusted origin mutation
   profiles computed for lymphoma genomes, we observed minimal
   contribution from the COSMIC SBS2 and SBS13 signatures, which are
   attributed to the activity of the APOBEC family of cytidine deaminases
   (Supplementary Fig. [110]4k). This observation suggests that cytosine
   deamination at origins in B cell germinal centre lymphomas is primarily
   driven by AID.

Polymerase δ behaviour at non-B DNA structures drives origin mutagenesis

   Cluster 2 samples are derived from a wide range of cancer types, with
   focal mutagenesis at origins primarily driven by T > G mutations (see
   Fig. [111]2c and Supplementary Fig. [112]5a). We note that this
   signature is independently observed across all cancer types within
   cluster 2 tumours (Supplementary Fig. [113]5b), suggesting a common
   mechanism that is not tissue-specific. Investigating this distinctive
   signature, we first examined the extended contexts of these mutations
   at origins and their flanks (Fig. [114]4a). Interestingly, T > G
   mutations at origins, unlike those within origin flanks, tend to occur
   within G-rich sequences, particularly embedded within G tracts. This
   suggests a potential association with non-B DNA structures, such as
   G-quadruplexes (G4s), known to hinder replication and induce
   mutagenesis and genetic instability^[115]23,[116]24. To probe the role
   of G4s in mutagenesis at origins, we analysed short sequences
   overlapping with T > G mutations at origins and their flanks (25 bp
   around mutations), predicting their susceptibility to fold into G4
   structures using the G4Hunter algorithm, which considers G-richness and
   G-skewness to generate a quadruplex propensity score (G4H
   score)^[117]25. Our analysis revealed that sequences surrounding T > G
   mutations at origins in cluster 2 genomes exhibit significantly higher
   overall positive G4H scores compared to sequences containing T > G
   mutations in origin flanks or sequences containing any T residues at
   origins (Fig. [118]4b). This indicates a potential driving role of G4
   structures in mutagenesis at origins. We next identified sequences with
   the potential for G4 formation following a
   N[5]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[5] pattern (where N is
   any base)^[119]26 within origin domains and computed their G4H scores.
   Remarkably, sequences bearing T > G mutations display higher G4H scores
   than those without mutations (Fig. [120]4c), suggesting that G4
   structures, rather than solely G-rich sequences, are driving
   mutagenesis, with the most stable predicted G4s being the most mutated.
   We confirmed this by examining mutagenesis at G4 structures
   experimentally mapped in vivo using data generated by the G4access
   method^[121]27. Experimentally validated G4 structures mutated at
   origins are predicted to be more stable (Supplementary Fig. [122]5c)
   and have shorter loops (Supplementary Fig. [123]5d, [124]e) than those
   that are not mutated.

Fig. 4. Polymerase δ behaviour at G-quadruplex structures drives origin
mutagenesis in diverse cancer types.

   [125]Fig. 4
   [126]Open in a new tab

   a River plot illustrating the sequence contexts of T > G mutations at
   origins or origin flanks for cluster 2 tumour samples. b Distribution
   of the G-quadruplex (G4) propensity scores, G4H scores, associated with
   short 51 bp sequences encompassing T > G substitutions at origins and
   origin flanks, or any T residues at origins. c Distribution of G4H
   scores for all or T > G mutated sequences at origin domains, fitting
   the pattern N[5]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[5] (where N
   is any base), representing potential G4 formation. Box plot report
   medians and interquartile ranges. Outliers are shown as grey dots.
   ***P < 0.001, Kolmogorov-Smirnov test. d Distribution at
   single-nucleotide resolution of mutations and G frequency at G4-forming
   sequences found within 1 kb domains centred on origins. The start of
   the G4s is defined as the first G from the first G tract of the
   previous pattern. G4-forming sequences were oriented toward the
   direction of replication. e Absolute contribution of COSMIC signatures
   SBS10b (blue, left hand Y-axis) and SBS10c (red, right hand Y-axis) to
   mutagenesis at origins, associated with the activity of POLE and POLD1,
   respectively. Both signatures are attributed to defective proofreading
   due to acquired mutations in the exonuclease domains of the
   polymerases.

   Having established the role of G4s in driving mutagenesis at origins,
   we delved deeper into understanding the characteristics of our cluster
   2 origin signature. We hypothesised that T > G mutations in the
   specific GTG context (Fig. [127]2c) might be disproportionately
   represented in G4-forming sequences. We analysed the occurrence of each
   of the 96 trinucleotide contexts within the established G4 origin
   pattern and revealed an enrichment of the GTG context within G4 forming
   sequences at origins (Supplementary Fig. [128]5f). Accounting for the
   trinucleotide composition of G4s enabled us to extract a mutational
   signature specific to G4s (Supplementary Fig. [129]5g), illustrating
   the diverse spectrum of mutations observed at these sites. In this
   context, we note that G4s contribute on average ~33%, and up to 72%, to
   mutagenesis at the origins of cluster 2 tumours (Supplementary
   Fig. [130]5h,[131]i).

   Furthermore, we examined the mutation pattern at origin G4s by
   precisely mapping mutations at G4 sites oriented by the direction of
   replication (Fig. [132]4d). Our findings indicate an accumulation of
   mutations at the first nucleotides before and after the initial G tract
   of G4s, aligning with biochemical observations^[133]28,[134]29,
   suggesting that polymerases may stall or proceed with errors past the
   first G tract of G4s during polymerase extension. This supports that
   mutagenesis at origins in cluster 2 is primarily a result of polymerase
   stalling at G4s. However, we observed that our cluster 2 signature does
   not exhibit a replication strand bias at origins (Supplementary
   Fig. [135]5j), and that G4s at origins are not predicted to be
   inherently more stable than G4s at their flanks (Supplementary
   Fig. [136]5k). This suggests that a single replicative polymerase may
   be responsible for mutagenesis at origins, and its synthesis errors are
   confined within origin domains.

   Evidence, from both biochemical studies in yeast and genetic
   investigations in humans, supports the theory that following the
   priming of DNA replication by polymerase α, polymerase δ initiates both
   leading and lagging strand synthesis, thus performing the bulk of DNA
   synthesis near replication origins^[137]7,[138]30–[139]32. This initial
   DNA synthesis following origin firing occurs uncoupled from helicase
   activity and is likely more susceptible to polymerase stalling,
   potentially accounting for the increased mutation rate observed at G4s
   near origins. Consequently, we evaluated the exposure of origins to
   mutational signatures associated with the activity of POLD1 and POLE,
   which become more pronounced when the exonuclease domains of these
   polymerases are compromised^[140]33. Our analysis revealed that the
   local distribution of these signatures aligns with polymerase δ being
   the active polymerase within 1-kilobase domains centred on human
   origins (Fig. [141]4e). Taken together, these findings support a model
   wherein polymerase δ stalls and triggers error-prone DNA synthesis at
   G4 structures in the immediate vicinity of replication origins of
   cluster 2 tumours.

Mutagenesis at origins in pancreatic adenocarcinoma (PACA) correlates with
markers of increased proliferative drive and replicative stress

   DNA secondary structure-dependent mutagenesis at origins is shared
   among various cancer types, although it is not uniformly observed
   within all samples of the same cancer type. Our next goal was to
   elucidate the specific cancer features that influence the intensity of
   mutagenesis at origins. We selected pancreatic ductal adenocarcinoma
   (PACA) as a model cancer due to its wide range of variation in the
   ratio of mutations at origins to flanks (Fig. [142]1c) and the
   availability of a sufficiently comprehensive matched transcriptomic
   dataset. We first identified distinct cancer subtypes within PACA based
   on transcriptomic profiles. Using a uniform manifold approximation and
   projection dimensionality reduction (UMAP) on normalised gene
   expression counts of individual samples (Fig. [143]5a), we identified
   three subtypes, henceforth referred to as PACA 1 to 3, each exhibiting
   different levels of mutagenesis at origins. While all subtypes show
   similar genome-wide mutation burdens (Supplementary Fig. [144]6a), PACA
   1 exhibits a higher degree of mutagenesis at origins compared to PACA 2
   and 3 (Fig. [145]5b). Notably, tumours classified as PACA 1 were
   characterised by the presence of our cluster 2 origin signature, a
   feature attenuated in PACA 2 and absent in PACA 3 (Supplementary
   Fig. [146]6b), suggesting that DNA structure-dependent mutagenesis at
   origins occurs to varying extents in these cancer subtypes.
   Consistently, exposure to our cluster 2 signature at origins was higher
   in PACA 1 compared to PACA 2 and 3 (Fig. [147]5c). These observations
   indicate that UMAP reduction enables the stratification of pancreatic
   cancers into distinct subtypes, facilitating the study of determinants
   influencing mutagenesis at origins.

Fig. 5. Replicative stress exacerbates origin mutagenesis in pancreatic
ductal adenocarcinoma.

   [148]Fig. 5
   [149]Open in a new tab

   a UMAP dimensionality reduction of transcriptomic profiles from 194
   pancreatic ductal adenocarcinoma samples (PACA project) reveals three
   cancer subtypes denoted PACA 1 to 3. b Distribution of origins/origin
   flanks mutation density ratios for individual cancer samples
   categorised by PACA subtypes. Box plot report medians and interquartile
   ranges. Grey dots represent individual cancer samples. ***P < 0.001,
   Kolmogorov-Smirnov test. c Relative contribution of Cluster 2 signature
   to mutagenesis at origins. d Pathway enrichment analysis of genes
   dysregulated in PACA subtype 1 compared to subtypes 2 and 3. Normalised
   enrichment scores indicate the extent of upregulation of gene sets
   representing KEGG pathways in PACA 1 tumours. Selected dysregulated
   pathways are colour-coded: cancer and cell proliferation (blue),
   carbohydrate metabolism (orange), and DNA repair and replication (red).
   e Gene expression levels of replication stress biomarkers in pancreatic
   cancer^[150]36, categorised by PACA subtypes, represented as Z scores
   computed from the distribution of gene expression values across all
   PACA samples.

   We proceeded with a differential gene expression analysis to identify
   genes displaying altered expression in PACA 1 compared to PACA 2 and 3
   (Supplementary Fig. [151]6c). Subsequently, the results of this
   analysis were used for a gene set enrichment analysis (GSEA) to explore
   the altered KEGG pathways in PACA 1 (Fig. [152]5d and Supplementary
   Fig. [153]6d, details in Methods). The GSEA revealed the upregulation
   of cancer pathways (P = 7.44 × 10^−5, permutation test) linked with
   increased cell growth and proliferation, such as the JAK-STAT or MAPK
   pathways^[154]34,[155]35. This led us to hypothesise that increased
   mutation rates at origins might stem directly from replicative stress
   induced by excessive cell proliferation. To address this hypothesis, we
   evaluated the expression of biomarkers associated with replicative
   stress in pancreatic cancers^[156]36 and discovered that tumours from
   PACA 1 and PACA 2 subtypes indeed exhibit cellular stresses interfering
   with DNA replication (Fig. [157]5e), contrasting with the absence of
   such stresses in PACA 3. However, PACA 1 and PACA 2 tumours displayed
   similar levels of replicative stress biomarkers, suggesting other
   contributing factors. We speculated that these subtypes might diverge
   in their growth profiles. To investigate this further, we inferred the
   duration and respective phases of their cell cycles from the normalised
   expression of the four classes of cyclins (details in Methods). Our
   analysis revealed that PACA 1 and 2 tumours exhibit comparable levels
   of cyclin D1 expression, which is higher than that of the PACA 3
   subtype (Supplementary Fig. [158]6h). Moreover, cyclin D1 expression
   correlated with mutation burden at origins (Rho = 0.404, Supplementary
   Fig. [159]6i) indicating that cell proliferation is associated with
   mutational burden at origins. Additionally, we found that the PACA 1
   subtype is characterised by a longer S-phase and shorter G1 and G2
   phases compared to PACA 2 and 3 tumours (Supplementary Fig. [160]6j).
   Collectively, these observations demonstrate that increased cell
   proliferation and cell cycle dysregulation exacerbate the mutational
   burden at origins.

   Our GSEA analysis also uncovered an association between the
   dysregulation of carbohydrate metabolism pathways and mutagenesis at
   origins (P = 9.84 × 10^−14, permutation test, Supplementary
   Fig. [161]6e). To delve deeper into this observation, we examined the
   specific impact on glycolysis and the pentose phosphate pathway (PPP)
   in the PACA 1 tumour subtype (Supplementary Fig. [162]6f). Both
   pathways are intricately linked and exert significant influence on
   regulating DNA synthesis and repair in cancer^[163]37.
   Glucose-6-phosphate serves as a common substrate for both pathways,
   contributing to the production of pyruvate and ATP in glycolysis or
   ribonucleotides in the PPP. Inhibiting ribonucleotide synthesis is
   expected to affect DNA synthesis and repair^[164]38. Consequently, we
   hypothesised that the stimulation of glycolysis and inhibition of the
   PPP might elevate mutation rates at origins. Indeed, we observed a
   strong association (Supplementary Fig. [165]6f), with notable
   correlations between the expression levels of key enzymes and the ratio
   of mutation at origins versus flanks (Supplementary Fig. [166]6g). For
   instance, both the expression of ATIC (a bifunctional enzyme involved
   in the final two steps of purine biosynthesis) and TALDO1 (an enzyme
   responsible for generating ribose-5-phosphate) exhibited a negative
   correlation with mutational burden at origins (Rho = −0.403 and −0.372
   respectively). The limited availability of comprehensive matched
   mutation and transcriptomic data from other cancers currently limits
   our analysis to PACA. While it will be important to extend these
   observations to cancer types beyond PACA, our findings nonetheless
   suggest an unexpected interplay between metabolic pathways and
   mutagenesis at origins in cancers.

Constitutive origins are hotspots for genome rearrangements

   Our previous GSEA findings (Fig. [167]5d) provide support for the
   downregulation of pathways associated with DNA replication and repair
   as a primary factor influencing the mutation rate at origins in
   pancreatic cancers (P = 3.73 × 10^−9, permutation test, Supplementary
   Fig. [168]6d). To further explore this result, we dissected these
   pathways into smaller components and observed that the PACA 1 tumour
   subtype exhibits downregulation of cell cycle dependent DNA damage
   checkpoints and downregulation of factors involved in processing DNA
   single-strand breaks (Fig. [169]6a). Consequently, breaks originating
   at origins due to polymerase stalling at G4s are more likely to
   progress into S phase or mitosis without undergoing repair, a scenario
   conducive to the emergence of genome rearrangements during cancer
   progression^[170]39. Indeed, early replicating regions of the
   genome^[171]8 and ectopic sites of replication initiation induced by
   oncogene overexpression^[172]4 have been previously shown to be
   hotspots for genome rearrangements. We therefore evaluated the
   prevalence of genome rearrangements by focusing on structural and copy
   number variants, referred to as SVs and CNVs respectively. Our analysis
   revealed that PACA 1 and 2 subtypes are characterised by an increase in
   break ends at origins (Fig. [173]6b) and a higher number of copy number
   segments (Fig. [174]6c) compared to the PACA 3 subtype. These findings
   suggest that origins may serve as hotspots for genome rearrangement in
   pancreatic cancers.

Fig. 6. Constitutive origins serve as focal points for structural and copy
number variants.

   [175]Fig. 6
   [176]Open in a new tab

   a Analysis of DNA repair and replication pathways. Primary KEGG
   pathways were dissected into smaller components to identify specific
   processes downregulated in PACA subtype 1 compared to subtypes 2 and 3.
   b Distribution of structural variant (SV) break ends at origins across
   PACA tumour subtypes. Enrichment of break ends is presented as the fold
   change from background calculated from origin flank values. c
   Distribution of the number of copy number variation (CNV) segments in
   individual PACA samples categorised by PACA tumour subtypes. Box plots
   illustrate medians and interquartile ranges, with grey dots indicating
   individual cancer samples. n. s. non-significant, ***P < 0.001,
   Kolmogorov-Smirnov test. d Pan-cancer signatures of structural
   variation based on break ends mapped at origins or origin flanks,
   categorised by SV types and lengths. SV types are: INS (insertion), DEL
   (deletion), DUP (duplication), INV (inversion) and ITX (translocation).
   e Distribution of SV break ends at 5 nt resolution at G-quadruplex
   forming sequences within origin or origin flank domains. f Signatures
   of copy number variation for PACA tumour subtypes, classified by CNV
   types, copy number, and evidence of loss of heterozygosity (LOH) or not
   (Het). Each CNV type is further classified by increasing segment sizes
   (from left to right). Segment sizes were excluded from this plot to
   enhance clarity. Detailed information on segment sizes can be found in
   the Methods section. g Enrichment of origins within amplified CNV
   segments in PACA subtypes. Enrichment values were evaluated by
   partitioning CNV segments and their flanking domains (±2 Mb) into an
   equal number of windows, calculating the number of origins per segment
   and per window, and then determining the mean number of origins per
   window. This value was then adjusted by subtracting the mean values
   observed in segment flanking domains. The resulting origin enrichment
   value indicates the excess of origins within a given window.

   Due to the limited number of SVs observed at origins in PACA cancers,
   we expanded our investigation to characterise structural variation at
   origins across various cancer types. Our analysis confirmed an increase
   in break ends at origins pan-cancer (Supplementary Fig. [177]7a) and we
   proceeded with establishing the signatures of structural variation at
   origins and origin flanks using a standardised classification
   procedure^[178]40. This approach revealed an enrichment of short tandem
   duplications at origins compared to their flanks (Fig. [179]6d). These
   enriched duplication events occur to the detriment of other SVs
   (Supplementary Fig. [180]7b), exhibit a median length of ~100 base
   pairs (Supplementary Fig. [181]7c) and stem from break ends occurring
   predominantly at Gs (Supplementary Fig. [182]7d). This latter
   observation underscores the potential role of breaks originating from
   polymerase stalling at G4s as the primary driver of these events. To
   validate this hypothesis, we examined the distribution of SV break ends
   in the vicinity of G4s (Fig. [183]6e) and found an enrichment of break
   ends within 5 bp of the start of G4 motifs at origins but not at G4s
   within origin flanks. Taken together, these findings indicate that
   breaks arising from replication initiation are transformed into a
   diverse array of short tandem duplications, with the potential to alter
   genome structure and function.

   Finally, we characterised the copy number segments mapped within PACA
   subtype genomes. Notably, we observed that the increase in segment
   numbers in PACA 1 and 2 (Fig. [184]6c) coincides with the presence of
   shorter segments with median lengths of ~2 Mb (Supplementary
   Fig. [185]7e), suggesting a potential association between origins and
   focal copy number alterations. Analysing the pattern of copy number
   alterations in PACA subtypes, focusing on length, absolute copy number,
   and potential loss of heterozygosity (LOH) following an established
   procedure^[186]41, yielded distinct signatures for PACA 1 and 2 when
   compared to PACA 3 (Fig. [187]6f). Firstly, PACA 1 and 2 subtypes are
   characterised by the presence of high copy number segments (3 and
   above) that underwent LOH, indicative of chromosomal duplication and
   segmental aneuploidy^[188]41, possibly resulting directly from the
   replicative stress experienced by these subtypes. More interestingly,
   these subtypes also exhibited an increase in short segments (1 to 3 Mb)
   amplified to 3 – 5 copy numbers without signs of LOH. To explore a
   potential role for constitutive origins in generating these segments,
   we mapped the distribution of origins within PACA 1 and 2 amplified
   segments with copy numbers higher or equal to 3 and found that origins
   marked their boundaries (Fig. [189]6g). A permutation test confirmed
   the statistically significant enrichment of origins at the boundaries
   of these segments (P ≤ 0.001 over 1000 permutations, Supplementary
   Fig. [190]7f). Importantly, such enrichment was not observed for
   amplified segments from the PACA 3 subtypes (Fig. [191]6g) or for loss
   segments across any PACA subtypes (Supplementary Fig. [192]7g). To
   better understand the mechanism by which breaks induced at origins
   drive copy number alteration, we noted that origin enrichment at
   segment boundaries was confined to amplified segments of 3 or 4 copy
   numbers (Supplementary Fig. [193]7h) showing no signs of LOH
   (Supplementary Fig. [194]7i). Collectively, these observations support
   and extend the notion that replication origins are hotspots for copy
   number alteration and suggest that breaks originating from replication
   initiation drive focal rearrangements through a specific mechanism
   discussed below.

Discussion

   Our analysis identifies and characterises biological processes
   associated with replication initiation that focus mutagenesis and
   genome rearrangements at thousands of constitutive origins. Notably,
   while the origin domains we examined cover <0.3% of the human genome,
   they can contribute up to 2.7% of the mutations in some cluster 2
   cancer samples. Considering that constitutive origins are enriched
   within functional regions of the human genome^[195]7, such as promoters
   and exon/intron junctions, we anticipate that this focused mutagenesis
   contributes to the evolution and expression of cancer genomes.
   Furthermore, we found that origin activity leads to the emergence of
   numerous tandem duplication events and copy number alterations with
   potential functional consequences. For instance, we identified the
   duplication of short origin sequences within the first introns of the
   MYC and TYK2 proto-oncogenes. These events have the potential to alter
   the expression of these genes, as origin sequences are enriched for
   specific transcription factor binding sites and cis-regulatory
   elements^[196]7,[197]42. We also mapped the amplification of megabase
   segments, marked with constitutive origins at their boundaries,
   containing pancreatic cancer associated oncogenes such as EIF3B or MTA1
   that may drive specific cancer phenotypes^[198]43,[199]44.

   While the mechanisms underlying some of the mutational processes, we
   identified are clear, further investigations are necessary to
   understand the molecular basis for origin-associated genome
   rearrangements. Although we observed distinctive signatures in SVs and
   CNVs at origins, the mechanistic basis of these events remains
   speculative. We have previously demonstrated that breaks are enriched
   at constitutive origins^[200]7. Our data supported that such breaks
   were associated with enrichment of factors associated with
   non-homologous end joining, which may be expected to favour the
   observed deletions more than sequence duplications and amplification.
   Thus, such breaks would not likely explain the association of
   constitutive origins and CNVs. However, double-stranded DNA breaks with
   protruding ends, such as those that may arise following polymerase
   stalling at G4s near origins, are known to trigger small tandem
   duplications following overhang gap filling and end-joining (reviewed
   in ref. ^[201]45). Indeed, secondary structures, particularly G4s, have
   been associated with CNV boundaries in both cancer^[202]24 and the
   germline^[203]46. In this context, the size of tandem duplication
   events would be controlled by the location of the breaks and the site
   where DNA synthesis is initiated. Intriguingly, tandem duplications are
   confined within the 1-kilobase domains defined by replication origins,
   leading to duplication of origin sequences. Due to the absence of LOH
   in the resulting amplified segments, we can only speculate that they
   may result from a mechanism such as fork stalling and template
   switching (FoSTeS)^[204]47. In this mechanism, the decoupling between
   leading and lagging strand synthesis, caused by polymerase stalling at
   G4s, would trigger a switch to a region with microhomology, such as
   another replication origin or G4-rich domains, to restart DNA
   synthesis. Such events could occur multiple times before the
   replication fork returns to its original position to continue
   replication to the chromosome end, favouring duplications over
   deletions and resulting in the amplification without LOH observed in
   our analysis.

   Our findings are based on the comprehensive analysis of mutagenesis
   occurring at ‘constitutive’ origins that are conserved across several
   human cell types^[205]6. These origins are recurrent sites for
   replication initiation and accumulate mutations at each cell division.
   However, it is important to acknowledge the potential variability in
   the activity and efficiency of these origins across different cancer
   types. Indeed, certain cancers may inactivate some origins or exhibit
   variations in their efficiency compared to cultured cells.
   Additionally, the induction of oncogenes can activate dormant origins,
   further complicating the estimation of global mutation rates at these
   sites^[206]4. Consequently, our estimations may be subject to some
   degree to both underestimation and overestimation. Nevertheless, our
   study reveals compelling associations across pan-cancer and
   tissue-specific analyses, strongly advocating for replication
   initiation playing a critical role in shaping and focussing specific
   mutational processes at origin sites. Notably, we demonstrate that the
   mutagenic potential of replication origins is exacerbated by
   replicative stress associated with tumour progression. Furthermore, we
   show that the intrinsic biology of cancer subtypes significantly
   influences the intensity of mutagenesis observed at these sites. These
   collective observations underscore the integral role of replication
   origins as fundamental genetic elements regulating the mutagenesis that
   drives the development and progression of human cancers.

Methods

Cancer mutation and associated datasets

   Somatic mutation data for cancer, including single nucleotide variants
   (SNVs, generally referred to as mutations in the manuscript),
   structural variants (SVs), and copy number variants (CNVs), were
   sourced from the International Cancer Genome Consortium (ICGC, release
   28). SNVs and SVs for individual cancer samples were derived from the
   Broad variant calling pipeline, while CNVs for pancreatic ductal
   adenocarcinoma samples were obtained from the DKFZ/EMBL variant calling
   pipeline. These mutations were directly used for analysis. Only cancer
   samples that were whole-genome sequenced and had >5000 mutations were
   considered for computation of SNV mutation rates, or 50 mutations at
   origin domains for single base substitution mutation signature
   analysis. The minor allele frequency was calculated in bins of 100
   nucleotides as the number of occurrences of a given mutation divided by
   the total number of screened cancer samples. Transcriptomic profiles
   were generated from sequencing-based gene expression data of ICGC
   tumours, focusing only on protein-coding genes. Gene expression levels
   were quantified as transcripts per million (TPM) using raw read counts
   and exon lengths associated with all transcripts mapped to a given
   gene. Exon lengths were retrieved from the
   TxDb.Hsapiens.UCSC.hg19.knownGene annotation R/Bioconductor package.
   Throughout the manuscript we refer to cancer types using the ICGC
   cancer project nomenclature and their associated acronyms; BOCA: Bone
   cancer, BRCA: Breast cancer, BTCA: Biliary tract cancer, CLLE: Chronic
   lymphocytic leukaemia, CMDI: Chronic myeloid disorders, EOPC: Early
   onset prostate cancer, ESAD: Oesophageal adenocarcinoma, GACA: Gastric
   cancer, LAML: Leukaemia, LICA: Liver cancer, LINC: Liver cancer, LIRI:
   Liver cancer, MALY: B cell malignant lymphoma, MELA: Skin cancer, ORCA:
   Oral cancer, OV: Ovarian cancer, PACA: Pancreatic cancer, PAEN:
   Pancreatic endocrine neoplasms, PBCA: Paediatric brain cancer, PRAD:
   Prostate Adenocarcinoma and RECA: Renal cell cancer.

Somatic mutation datasets

   Somatic mutation data from non-cancerous tissues were obtained from
   Abascal et al.^[207]13 and Moore et al.^[208]14 Abascal et al. utilised
   nanorate sequencing (NanoSeq), a duplex sequencing technique with
   extremely low error rates, to quantify somatic mutations in
   non-dividing cells across various tissues, including cord blood
   granulocytes, smooth cardiac muscle, post-mitotic neurons, colonic
   crypts, and sperm cells. To assess mutational burden at origins, we
   combined all identified mutations and aligned the signal with
   constitutive origins. Moore et al. provided data on somatic mutations
   from multiple cell types in three individuals. These somatic mutations
   were identified by comparing mutations in 29 anatomical structures from
   9 organs to germline mutations, and these data were directly used for
   analysis.

Constitutive origin dataset

   The constitutive origins analysed in our study were identified using
   our high-resolution origin mapping technique, initiation site
   sequencing 2 (ini-seq 2), in the EJ30 bladder carcinoma cell
   line^[209]6. These origin coordinates, available from the Gene
   Expression Omnibus archive with the accession number [210]GSE186675,
   initially mapped in the hg38 human genome assembly, were converted to
   the hg19 assembly using the liftOver tool from the rtracklayer
   R/Bioconductor package^[211]48, with chain files obtained from
   [212]http://hgdownload.cse.ucsc.edu. Previous research has shown
   significant overlap between these origins and those identified by short
   nascent strand isolation coupled with next-generation sequencing
   (SNS-seq) in the same cell line, as well as with a database of ‘core’
   origins identified by SNS-seq in 19 other human cell samples^[213]12.
   To ensure accurate comparison of mutation rates, we considered 10 kb
   flanking regions on either side of origin midpoints. Additionally, to
   avoid potential biases from replication origin clusters, we focused
   only on domains containing a single origin resulting in analysis of
   9,341 constitutive origins, refer to as ‘isolated’ origins. The
   coordinates of origins considered in this work is reported in
   Supplementary Data [214]1. To evaluate similarities in terms of SNV
   count between ‘isolated’ origins and clustered origins, we define
   origin clusters as genomic intervals >20 kb that contain at least two
   constitutive origins. On average, origin clusters contain ~7 origins.
   To compare mutation rates at constitutive and core origins, we define
   ‘core’ origins, in this article, the cell type-independent origins
   identified by SNS-seq^[215]12 that do not overlap with those mapped
   using ini-seq 2. We consider two initiation sites to represent the same
   origin if the mapped domains are located within <5 kb of each other.

Defining genomic regions and mutation densities

   In our analysis, we contrast mutation rates and mutational mechanisms
   at origin domains with those at their flanking domains. We define
   origin domains as 1 kb regions centred on origin midpoints, and origin
   flanking domains as 20 kb regions also centred on origin midpoints but
   excluding the previously defined origin domains. We calculated mutation
   rates near origins using 100 bp windows, counting the number of
   relevant mutations and adjusting for the number of covered bases. To
   assess mutational burden at origins, we computed ratios of the
   length-normalised mutation counts at origin domains relative to those
   in flanking regions.

Single base substitution mutational signature analysis

   Evaluation and visualisation of substitution mutational patterns at
   constitutive origins were performed using the MutationalPatterns
   R/Bioconductor package^[216]49. We defined mutation count matrices by
   considering the frequencies of each of the six pyrimidine
   substitutions, calculated in each of the possible 96 trinucleotide 5′
   to 3′ contexts. To compute background-adjusted signatures at origins,
   we adjusted mutation count matrices obtained from origin domains based
   on their trinucleotide composition and to similar mutation count
   matrices obtained from origin flanking domains. Tumour sample
   clustering was performed using these adjusted signatures to
   differentiate samples according to mutation processes and cosine
   correlation similarities were used to quantify the closeness of
   origin-associated signatures. Genome-wide signatures were not corrected
   for local variation in trinucleotide composition. Origin-associated
   mutational signatures were compared to known signatures of somatic
   mutations in cancerous human tissues collected by the Catalogue of
   Somatic Mutations in Cancer (COSMIC, release v3.2)^[217]15 and
   available from [218]https://cancer.sanger.ac.uk/signatures/sbs.
   Exposures to both newly reported and COSMIC signatures were computed in
   100 bp windows using the fit_to_signatures and fit_to_signatures_strict
   function of the MutationalPatterns package respectively. For COSMIC
   signature refitting experiments, only signatures contributing to at
   least 50 mutations within at least a window were considered.

Nucleotide excision repair profiles at constitutive origins

   Genome-wide NER sequencing (XR-seq) datasets of cyclobutane pyrimidine
   dimers (CPD) and pyrimidine-pyrimidone (6–4) photoproducts (6–4 PP)
   from ultraviolet-irradiated CSB/ERCC6 mutant NHF1 skin
   fibroblasts^[219]18, were retrieved from Gene Expression Omnibus
   ([220]GSE67941). To quantify repair against DNase I hypersensitivity
   (DHS) at origins, ENCODE consortium data for the GM04504 skin
   fibroblast cell line was used. Sequencing data were formatted into
   bigwig files to report read counts at 50 nt resolution. These files
   were used to compute coverage matrices and generate heatmaps using the
   EnrichedHeatmap R/Bioconductor package^[221]50. Correlations between
   mutation counts for cluster 1 tumours, XR-seq, and DHS signals at
   origins were computed based on averaged signals over origin domains
   (origin midpoints ± 500 bp). NER strand bias around origins was
   evaluated by analysing strand-resolved XR-seq signals in 100 bp
   windows, corrected for dinucleotide base composition. As reported, the
   MC-062 and 64M-2 antibodies were used for CPD and 6–4 PP XR-seq data
   generation, respectively, with MC-062 targeting thymidine dimers and
   64M-2 targeting UV-C irradiated DNA, in which 6–4 PP adducts are
   generated at CC, CT, and TC dinucleotides^[222]51. Thus CPD and 6–4 PP
   XR-seq data were adjusted for the frequencies of these dinucleotides.
   To confirm that locally reduced NER activity persists at the origins
   when controlling for transcription factor binding, we used publicly
   available ChIP-seq datasets for CTCF, ETS1, REST, EZH2, and NANOG from
   ENCODE
   ([223]http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encRegTfbsClustere
   d/) in the GM23338 cell line.

G-quadruplex forming propensity

   To assess the likelihood of a sequence folding into a G-quadruplex (G4)
   structure and to quantify the predicted stability of the resulting
   motif, we employed the G4H score, which is derived from the G4 Hunter
   algorithm^[224]25. This algorithm takes into account the richness and
   skewness of guanine (G) content within a sequence to compute its G4H
   score. Each position within a sequence receives a score ranging from −4
   to 4. G-skewness is considered by assigning a neutral score (0) to
   adenine (A) and thymine (T), a positive score to guanine (G) (as it
   promotes G4 formation), and a negative score to cytosine (C) (as it
   promotes hairpin and impedes G4 formation). G-richness is accounted for
   by assigning a score equivalent to the length of G tracts for each G,
   with a maximum score of 4. Similarly, cytosines in C tracts receive
   negative scores. The G4H score for a sequence is the average of these
   individual scores. Thus, the propensity of a sequence to fold into a G4
   motif correlates with its G4H score, with higher scores indicating G4
   formation on the plus strand and lower scores on the minus strand. To
   investigate G4 formation at specific positions, we extracted sequence
   contexts ( ± 25 bp) and computed associated G4H scores. To identify
   potential G4-forming sequences, we searched for sequences conforming to
   the regular expression pattern
   N[5]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[1–12]G[3+]N[5] (where N represents
   any base), which represents a slight modification from the known
   quadparser algorithm^[225]26, allowing for extended loop lengths.
   Additionally, we expanded the core motif for G4 formation by including
   five flanking bases to account for the G4 context when predicting
   stability using the G4 Hunter algorithm. Experimentally validated G4
   structures within cell chromatin were identified from G4access peaks.
   G4access is an antibody- and crosslinking-independent method used to
   isolate and sequence G4s associated with open chromatin through
   nuclease digestion^[226]27. In our analysis, following the approach of
   the original publication, we considered G4-forming sequences as the 25
   nt sequences with the highest G4H score within G4access peaks from
   HaCaT, K562, and Raji cells.

Differential gene expression and gene set enrichment analysis

   Differential gene expression analysis was performed using the DESeq2
   R/Bioconductor package^[227]52, with pancreatic ductal adenocarcinoma
   (PACA) samples grouped based on cancer subtypes identified through UMAP
   dimensionality reduction of their transcriptomic profiles. UMAP
   analysis was performed using the umap R package^[228]53 from TPM values
   of protein coding gene expression using two components and starting
   from 20 random states. Changes in gene expression were computed by
   contrasting PACA subtype 1 with PACA subtypes 2 and 3 independently.
   Differentially expressed genes were ranked using the DESeq Wald
   statistic and subjected to gene set enrichment analyses (GSEA) using
   the fgsea R/Bioconductor package^[229]54 and curated KEGG pathways.
   Hallmark gene sets sourced from the Molecular Signatures Database
   (MSigDB, version 7.1)^[230]55 were used, available as R lists in RDS
   format at [231]https://bioinf.wehi.edu.au/MSigDB/. Pathway enrichments
   were determined by comparing PACA subtype 1 with PACA subtypes 2 and 3
   individually, followed by averaging normalised enrichment scores and
   combining P-values using Fisher’s method. Pathways of interest were
   further dissected into smaller components using information from the
   Reactome pathway database ([232]https://reactome.org/)^[233]56. Once
   more, normalised enrichment scores and associated P-values were
   combined from independent analyses comparing PACA subtype 1 with PACA
   subtypes 2 and 3. Consequently, our analysis delineates pathways that
   are specifically dysregulated in PACA subtype 1, which exhibits a
   higher mutational burden at origins.

Replicative stress and cell cycle evaluation in pancreatic ductal
adenocarcinoma

   Replicative stress within PACA samples was evaluated from the gene
   expression profiles of established biomarkers^[234]36. Expression
   levels of these biomarkers were computed as Z scores relative to the
   distribution of raw count values across all PACA samples. Analysis of
   cell proliferation and cell cycle dysregulation in PACA samples was
   inferred from the expression patterns of the four cyclin classes:
   cyclin A (CCNA1 and CCNA2), cyclin B (CCNB1 and CCNB2), cyclin D
   (CCND1, CCND2, and CCND3), and cyclin E (CCNE1 and CCNE2) in
   transcripts per million (TPM). These expression profiles were used to
   estimate the cell population proportions in G1/S phase (calculated as
   the ratio of cyclin E expression to the sum expression of all cyclins),
   S/G2 phase (calculated as the ratio of cyclin A expression to the sum
   expression of all cyclins), and G2/M phase (calculated as the ratio of
   cyclin B expression to the sum expression of all cyclins)^[235]57.

Signature of structural variation

   Pan-cancer structural variant (SV) breakpoints were mapped to either
   origin or origin-flanking domains, and SV signatures within these
   domains were extracted following a standardised protocol^[236]40. SVs
   were annotated as INS (insertion), DEL (deletion), DUP (duplication),
   INV (inversion), and ITX (translocation) based on the orientation and
   position of the genomic location covered by the supporting reads, using
   a custom R script. Additionally, SVs were further classified based on
   their length into eight predefined intervals: 0–0.1 kb, 0.1–0.5 kb,
   0.5–1 kb, 1–5 kb, 5-10 kb, 10–100 kb, 0.1–1 Mb and >1 Mb. Finally, the
   frequency of occurrence for each SV event was computed.

Signature of copy number alteration in pancreatic ductal adenocarcinoma and
relationship with constitutive origins

   To identify signatures of copy number alteration in PACA subtypes, we
   adapted an established procedure^[237]40 with slight modifications to
   emphasise focal amplification of small segments. We combined segments
   from PACA subtypes 1 and 2 as they did not show significant differences
   in a preliminary analysis. Annotations from the ICGC data describing
   mutation types were used to evaluate loss of heterozygosity (LOH) or
   heterozygosity (Het) within CNV segments. Subsequently, segments were
   categorised based on their absolute copy number and length into seven
   predefined intervals: 0–0.1 Mb, 0.1–1 Mb, 1–2 Mb, 2–3 Mb, 3–5 Mb,
   5–10 Mb, and >10 Mb. The frequency of occurrence for each copy number
   variation (CNV) event was determined. We investigated the association
   between constitutive origins and copy number alterations by examining
   the distributions of origins within copy number segments. We first
   computed the number of origins per 100 kb window across the human
   genome, generating a bigwig file for subsequent coverage calculation at
   CNV segments. We then evaluated origin distribution by dividing CNV
   segments and their adjacent domains (±2 Mb) into 100 equally sized
   windows and determining origin coverage per segment and per window
   using the normalizeToMatrix function of the EnrichedHeatmap
   R/Bioconductor package^[238]50. Enrichment values were then computed by
   averaging origin counts across windows within segments of interest.
   These values were adjusted for background by subtracting the mean
   counts observed in flanking domain segments. Consequently, the reported
   enrichment values represent the excess of origins compared to
   background levels. CNV segments with length >100 kb and <5 Mb were
   considered for these analyses. To assess origin enrichment at the
   boundaries of amplified segments in PACA 1 and 2 cancer subtypes, we
   identified segment breakpoints as the start or end of mapped amplified
   segments. We evaluated the significance of overlap with origin domains
   using the overlapPermTest function of the regioneR R/Bioconductor
   package^[239]58, conducting 1,000 permutations.

Computational and statistical analyses

   Analysis and all statistical calculations were performed in R (version
   4.0.3). R packages: MutationalPatterns (3.15.0), DESeq2 (1.45.0),
   StructuralVariantAnnotation (1.22.0), dendextend (1.17.1), umap
   (0.2.10.0), EnrichedHeatmap (1.35.0), regioneR (1.38.0), rtracklayer
   (1.65.0), biomaRt (2.61.0), BSgenome.Hsapiens.UCSC.hg19 (1.4.3),
   org.Hs.eg.db (3.19.1), TxDb.Hsapiens.UCSC.hg19.knownGene (3.2.2),
   tidyverse (2.0.0), tidyr (1.3.1), dplyr (1.1.4), tibble (3.2.1),
   stringr (1.5.1), readxl (1.4.3), rlist (0.4.6.2), ggplot2 (3.5.1),
   ggpubr (0.6.0), scales (1.3.0), plotrix (3.8-4), forcats (1.0.0),
   gplots (3.1.3.1), wesanderson (0.3.7), RColorBrewer (1.1-3).

Reporting summary

   Further information on research design is available in the [240]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [241]Supplementary Information^ (4.8MB, pdf)
   [242]41467_2024_55148_MOESM2_ESM.pdf^ (81.1KB, pdf)

   Description of Additional Supplementary Files
   [243]Supplementary Data 1^ (379.2KB, txt)
   [244]Reporting Summary^ (1.4MB, pdf)
   [245]Transparent Peer Review file^ (1.7MB, pdf)

Acknowledgements