Abstract Metastasis remains a major cause of cancer mortality. This study, expanding upon previous findings in the MMTV-PyMT model, investigated four independent mouse models, representing luminal (MMTV-PyMT, MMTV-Myc), HER2-amplified (MMTV-Her2), and triple-negative (C3(1)TAg) breast cancer subtypes. Consistent with previous results, limited evidence for metastasis-associated somatic point mutations was found for all models. We also found that oncogenic drivers significantly influenced the number and size of metastasis-specific copy number variations (MSCNVs), but common driver-independent MSCNVs were rare. Furthermore, analyzing a cohort with varying genetic backgrounds while maintaining a constant oncogenic driver (PyMT) revealed that genetic background profoundly impacts MSCNVs. Transcriptome analysis demonstrated that oncogenic drivers strongly shaped metastasis-specific gene expression (MSGE), with each driver exhibiting distinct expression profiles. In contrast, MSGE in the PyMT-F1 cohort was more variable across strains. Despite the diversity of MSCNV and MSGE, functional analysis revealed that both mechanisms converge on the modulation of key cellular processes, including immune responses, metabolism, and extracellular matrix interactions. These findings emphasize the complex interplay between oncogenic drivers and genetic background in shaping the genomic and transcriptional landscapes of metastatic lesions. Subject terms: Cancer genomics, Cancer genomics __________________________________________________________________ Genetic background and oncogenic driver determines the genomic evolution and transcriptomics of mammary tumor metastasis. Introduction In the United States, breast cancer is the most frequently diagnosed cancer type and accounts for the third highest number of cancer-associated deaths annually. While patients with localized disease have a 5-year survival rate of 99.6% following tumor resection, the 5-year survival rate drops to 31.9% for patients with distant metastasis^[38]1. This disparity stems from the inability to effectively treat secondary lesions through either surgical resection or current therapeutic strategies. Metastatic tumors reside in microenvironments and exhibit therapeutic resistances that are distinct from those of the primary tumors, warranting further research and development to create new and more specific therapeutic strategies. Unfortunately, despite many recent advances, the etiology and mechanisms driving metastasis remain unclear. The conventional model of metastatic progression was first proposed by Peter Nowell in 1976, who hypothesized that tumors arise from a single mutated cell and continue to acquire additional mutations that persist if they enable survival and growth^[39]2. In this model, clones continue to evolve within the primary tumor (PT), ultimately developing into subclones with metastatic ability. Recent increased access to DNA-sequencing technologies has enabled several clinical studies to characterize the genomic landscape of metastatic tumors^[40]3–[41]5. However, although much of the data obtained to date is consistent with this hypothesis, no metastasis-driver mutations or acquired genomic aberrations have been identified in any cancer type. Indeed, current data suggest the acquisition of transcriptomic and epigenetic switching mechanisms, rather than mutations, drive dissemination and secondary tumor growth^[42]6–[43]9. To better understand the etiology of breast cancer metastasis, well-credentialed, physiologically relevant, and accessible experimental systems are required. Clinical data, while extraordinarily valuable, is difficult and expensive to obtain in large numbers, is experimentally limited, and is usually confounded by treatment, which may significantly influence data interpretation. Mouse models of metastatic cancer, therefore, provide a complementary experimental vehicle to assess the etiology of metastasis in a treatment-naïve, highly controlled setting. Indeed, several genetically engineered mouse models (GEMMs) have been developed to mimic the molecular subtypes of metastatic breast cancer found in patients. Our laboratory and many others have published extensively using the mammary-specific polyomavirus middle T antigen overexpression mouse model (MMTV-PyMT) to model the luminal B subtype of human breast cancers^[44]6,[45]10–[46]12. This model was developed in 1992 and faithfully recapitulates human breast cancer progression and metastasis to the lung^[47]13,[48]14. Previously, we performed genomic and transcriptomic analysis of the MMTV-PyMT model. By performing whole exome-sequencing (exome-seq) on genomic DNA (gDNA) from matched PT and lung metastatic (lung met) tissues, we showed that very few single-nucleotide variants (SNVs) occur in metastases at a high frequency that are not also found in the primary tumor and concluded that SNVs are an unlikely driver of metastasis^[49]6. In addition, we observed no obvious metastasis-associated translocations, but some metastasis-associated copy number variation (CNV). In this study, we expand upon this work to include both the MMTV-Myc and the C3(1)TAg models, as well as provide greater depth to our prior analysis of the MMTV-Her2 and PyMT models. In addition, we have further investigated how an individual’s personal ancestry might influence metastasis-associated genomic evolution in the MMTV-PyMT model by analyzing F1 animals generated from crosses between phylogenetically related mouse strains that exhibit significant differences in metastatic efficiency. Consistent with previous analyses, we confirm that SNVs with a metastasis driver function are not present in most of the animals analyzed, regardless of oncogenic driver or genetic background. However, we have identified genetic background as a strong determinant of PT stability, copy number region size, and gene-set enrichment within regions of gDNA gain or loss. Further, we show that gene expression at end-stage is not strongly correlated with CNV in our models. Finally, our work reveals no common gene or gene-set found within CNV, but rather cellular processes enriched in metastasis-specific gene expression and alternative splicing. Results Single nucleotide variants are unlikely drivers of metastasis To evaluate the relative contributions of oncogenic driver and genetic background to the genomic landscape of primary tumors and lung metastases, we developed two mouse model cohorts (Fig. [50]1a). The first cohort, the “oncogenic driver” cohort, utilized four distinct mouse models of metastatic breast cancer, each with a unique oncogenic driver but identical FVB genetic background. The mouse mammary tumor virus (MMTV)-neu (Her2) line was used to model Her2-amplified (Her2+) breast cancer, a subtype representing approximately 17% of breast cancer patients^[51]15,[52]16. The MMTV-Myc (Myc) line was used to model metastatic Myc-driven breast cancer, which is observed in approximately 30–50% of high-grade tumors^[53]17. The C3(1)/TAg-REAR (C3Tag) line was used to model a metastatic basal-like subtype, which is seen in 8% of breast cancer patients^[54]18,[55]19. Finally, the N-Tg(MMTV-PyVT)634Mul/J (PyMT) line^[56]20, which forms tumors that are highly metastatic, was used to model activated tyrosine kinase-driven cancers, which represent 49% of Luminal B, 32% of Luminal A, 42% of Her+, and 7% of Basal-like tumors, or 40% of all breast cancers^[57]21. Thus, the “oncogenic driver” cohort was comprised of four groups, PyMT, Her2, C3Tag, and Myc, with the FVB genetic background held constant. For the second cohort, or “PyMT-F1” cohort, FVB-PyMT males were bred to females from seven different genetic backgrounds and tumor-bearing PyMT-F1 offspring were evaluated. The genetic backgrounds chosen constitute two pairs (MOLF and CAST, BL6, and BL10) and one group of three (AKR, BALB, SEAGN) closely related strains with inverse metastatic efficiency (high (H) or low (L)) (Fig. [58]1a and Supplementary data [59]1)^[60]12, permitting us to evaluate the contribution of genetic background on the genomic landscape with oncogenic driver held constant. We isolated genomic DNA (gDNA) from matched pairs of primary tumors (PT) and lung metastases (Lung mets) from both cohorts (Supplementary data [61]1) and performed exome-sequencing (exome-seq) (Fig. [62]1a). Fig. 1. Model and genetic background shape MSCNV. [63]Fig. 1 [64]Open in a new tab a Schematic overview of mouse models, tissues collected, and analyses performed in this study. Histograms showing b the average percentage of the genome and c the average number of regions with CNV with standard deviation (SD), PT (blue), and lung mets (red). d Dot plot of individual CNV region lengths from aggregate data (the average indicated by black line). e Multivariate bubble plot showing average CNV number and aggregate total number of genes with CNV for PT (squares) and lung met (circles) from each model, with bubble size corresponding to CNV region size. Total number (black) of regions (f) and genes (g) with significant CNV gain (dark purple) and loss (light purple) in MSCNV. h Dot plot showing individual CNV region lengths from MSCNV (the average indicated by red line). i Multivariate bubble plot showing average CNV number and aggregate total number of genes with CNV for MSCNV (circles) from each model, with bubble size corresponding to CNV region size. j Histogram showing the average % of the genome and k average number of regions with CNV with SD, PT (blue), and lung mets (red). l Dot plot showing individual CNV region lengths from aggregate data with all animals per model combined (the average indicated by a line). m Multivariate bubble plot showing average CNV number and aggregate total number of genes with CNV for PT (squares) and lung met (circles) from each model, with bubble size corresponding to CNV region size. n Total number (black) of regions and o genes with significant CNV gain (dark purple) and loss (light purple) in Lung mets compared to PT. p Dot plot showing individual CNV region lengths from MSCNV (the average indicated by red line). q Multivariate bubble plot showing average CNV number and aggregate total number of genes with CNV for MSCNV (circles) from each model, with bubble size corresponding to CNV region size. Significance determined by Kruskal–Wallis test (d, h, l, p) or unpaired t test (b, c, j, k). *(p < 0.05), **(p < 0.01), ***(p < 0.001) ****(p < 0.0001), $ = significantly different from 6 groups, # = significantly different from 5 groups, ø = significantly different from 4 groups. Fig. 1a was generated using BioRender. We have previously shown that high allele frequency, metastasis-specific SNVs were not found at consequential rates in a smaller mouse cohort^[65]6. Using the exome-seq data from this larger cohort described above, we expanded the number of individuals used for SNV analysis. Consistent with our previous data, high frequency, metastasis-specific SNVs were not observed, suggesting that common metastasis driver mutations, analogous to common tumor driver mutations, are not major drivers of tumor progression (Supplementary fig. [66]1). However, as previously demonstrated^[67]6, low frequency metastasis-specific SNVs were observed, suggesting that either there are many low frequency metastasis driver SNVs or subsets of patients in which metastasis-specific SNVs contribute to tumor progression. However, due to the low frequency of the potential metastasis driver SNVs, subsequent analysis focused on copy number variation (CNV) induced by regions of genomic gain or loss to determine whether genome rearrangement functioned as more frequent metastasis driving events. Herein, through use of Nexus Copy Number Software 10.0, we have evaluated CNV differences across sites (PT or Lung mets), both within and between cohorts. Oncogenic driver has little influence over CNV event number We first evaluated the oncogenic driver cohort, visualizing PT and Lung met data separately for each FVB model, and calculated the average percent of the genome with CNV (Supplementary data [68]2). We observed no significant difference in the percentage of the genome with DNA gain or loss between PT and Lung mets, and no difference between oncogenic driver models (Fig. [69]1b). The average percentage of the genome with CNV ranged between 43–58% in PTs and 46–53% in Lung mets (Fig. [70]1b and Supplementary fig. [71]2). Similarly, the average number of regions with CNV did not differ significantly between tissue type or model (Fig. [72]1c and Supplementary data [73]2). To determine whether the high degree of CNV observed was the result of whole genome duplication, the samples were reanalyzed using the ABSOLUTE algorithm^[74]22. Unlike human tumors^[75]23, no evidence whole genome duplication was observed in the mouse samples. Despite having a similar degree of CNV overall, we observed key differences in the size of DNA regions with amplification or deletion. While no significant differences were noted in the CNV length for the Myc and Her2 models, CNV regions in the PTs were shorter in PyMT and longer in C3Tag compared to their matched Lung mets (Fig. [76]1d and Supplementary data [77]3). When analyzing regions of gain or loss separately, we found that regions of DNA gain were significantly larger (3572–4057 kb) than those of loss (806–1340 kb) in PTs from three models (Her2, Myc, and PyMT), and in Lung mets from all models (2497–3483 kb gains, 1370–2330 kb losses) (Supplementary fig. [78]3a, b). However, despite these slight but consistent differences in CNV gain and loss length, the four FVB oncogenic driver models maintained a similar number of regions and genes impacted by CNV (Fig. [79]1e). Oncogenic driver determines metastasis-specific CNV event number and size We next focused on metastasis-specific CNV (MSCNV) for each model, or regions altered in Lung mets but not PTs. Interestingly, oncogenic driver played a significant role in the number of MSCNV regions and the number of genes within them (Supplementary data [80]4). The Myc driven model had the highest number of MSCNV regions (10,576) impacting the highest number of genes (16,134), followed by PyMT (5198 regions, 10,823 genes), Her2 (3003 regions, 9213 genes), and C3Tag (818 regions, 3007 genes) (Fig. [81]1f, g and Supplementary data [82]4, [83]5). Regions of MSCNV were even shorter in length than the average Lung met CNV by around 10-fold, 200 kb compared to 2000 kb, respectively (Fig. [84]1h and Supplementary fig. [85]3c). Overall, there were small but significant differences in the average length of MSCNV between models (Fig. [86]1h). When evaluating CNV gain and loss separately, we again observed significant differences in the length of MSCNV regions in a model-dependent manner (Supplementary fig. [87]3c and Supplementary data [88]4). Myc and C3Tag MSCNV regions of gain were significantly longer than regions of loss, while the opposite was observed for PyMT, and no difference was observed for Her2. This data reveals an oncogenic driver-dependent influence on MSCNV number and size, thus impacting the number of genes with Lung met-specific CNV (Fig. [89]1i). Genetic background alters CNV event number and size We next performed similar analyses on samples from our PyMT-F1 cohort to evaluate the contribution of genetic background to CNV when the oncogenic driver (PyMT) was held constant (Fig. [90]1a and Supplementary data [91]1). While the percentage of the genome impacted by CNV was similar in Lung met tissue across strains (53–73%), significantly less was observed in PTs from BALB and AKR F1 mice, at only 18.24% and 33.89%, respectively (Fig. [92]1j, Supplementary fig. [93]4a and Supplementary data [94]2). More specifically, BALB and AKR had fewer regions of CNV within PT gDNA compared to matched Lung met tissue (Fig. [95]1k). Interestingly, when comparing the number of regions of Lung met CNV between all F1s we noted a divergence where gDNA from BL6, BL10, MOLF, and CAST tissues had between 600 and 900 regions, but gDNA from BALB, SEAGN and AKR had 3000–4000 regions (Fig. [96]1k). Due to the heterozygous genetic background of the F1 PyMT mice, we were also able to assess regions of allelic imbalance. While gDNA from most strains had a similar number of regions with allelic imbalance in PT and Lung mets (18–40), CAST and MOLF, closely related wild-derived strains, had a much higher number (65–128) (Supplementary figs. [97]3d, [98]5 and Supplementary data [99]2). When measuring CNV region size in tissue from PyMT F1s we observed an inverse relationship with region number. BL6, BL10, MOLF, and CAST F1 mice had larger regions of CNV than SEAGN, BALB, and AKR (Fig. [100]1l and Supplementary data [101]3). We theorized that this inverse relationship was likely responsible for the relatively constant percentage of the genome impacted by CNV, excluding BALB and AKR PTs. In addition to differences between strains, we also observed small but significant differences in the length of CNVs in Lung mets compared to PTs in CAST and BALB F1s (Fig. [102]1l). Analysis of region size was further broken down to genomic gains, losses, and allelic imbalance (Supplementary fig. [103]3e, f). Regions of genomic loss trended larger than those of gain in both PT and Lung mets, regardless of background (Supplementary fig. [104]32f). Further, in some F1 strains, the regions of allelic imbalance in PT and Lung met gDNA were significantly larger than the average CNV size. In strains with larger allelic imbalance, the size was roughly double the length of average CNV (Supplementary fig. [105]3e, f and Supplementary data [106]3). Taken together, PyMT F1 models have unique CNV fingerprints shaped by CNV region number and region size, therefore altering the number of genes impacted by CNV (Fig. [107]1m). Genetic background alters metastasis-specific CNV event number and size We next focused on MSCNV among the PyMT-F1 cohort. First, the closely related BL6 and BL10 strain pair had the fewest number of MSCNV, with only 50 and 39, respectively (Fig. [108]1n and Supplementary data [109]4). Closely related BALB, SEAGN, and AKR had the highest number of MSCNV, ranging from 2448 to 24,650. Though both MOLF and CAST are wild-derived strains, they exhibited vastly different numbers of MSCNV regions, at 1201 and 26, respectively. Interestingly, the number of regions of allelic imbalance within MSCNV did not appear to correlate with strain-relatedness and ranged from a single region observed in BL10 F1s to 121 regions in AKR F1s (Supplementary fig. [110]3g). As we observed with the oncogenic driver models (Fig. [111]1g), the number of genes impacted by MSCNV correlated closely with the number of regions, rather than the size of CNV region. BL6, BL10, and CAST had the smallest number of genes impacted by CNV, ranging from 139 to 264 (Fig. [112]1o and Supplementary data [113]5). BALB, SEAGN, and AKR had the highest number of genes contained within MSCNV, ranging from 2820 to 24,615 (Fig. [114]1n). Further, the number of genes altered by CNV appeared to be evenly distributed between regions of gain and loss. Finally, the number of genes found within regions of metastasis-specific allelic imbalance also correlated with the number of regions rather than region size (Supplementary fig. [115]3g-i). MSCNV region size again followed an inverse relationship with region number. SEAGN, BALB, and AKR had the smallest regions of MSCNV, ranging from 49 to 83 kb, whereas CAST had the largest average size of 1250 kb (Fig. [116]1p and Supplementary data [117]4). Comparison of MSCNV length between F1 strains revealed that each strain had a statistically different MSCNV length from all other F1s, and only BL6 and BL10 were not statistically different from one another. When region size for MSCNV was further broken down to gDNA gain, loss, and allelic imbalance, we noted once again that regions of allelic imbalance were larger than gain or loss for six of the seven F1 strains, a difference that was statistically significant in BALB and AKR (Supplementary fig. [118]3i). Metastasis-specific regions of gain and loss varied greatly in size across F1 models, but there was little difference in size when comparing regions of loss and gain within the same F1 strain (Supplementary fig. [119]3i). Together, this data reveals that metastasis-specific CNV follows a linear trend between CNV region number and region length, as well as between region number and the number of genes with CNV, which is modulated in a strain-dependent manner (Fig. [120]1q). Genetic background shapes gene set enrichment in CNV We next determined if specific genes or pathways were commonly enriched in MSCNV. For the FVB oncogenic driver models, 1676 genes were commonly found within regions of MSCNV across all four models, which accounted for 10-56% of the overall number of genes within MSCNV in each model (Fig. [121]2a and Supplementary fig. [122]6a). Gene ontology (GO) analysis of this common gene set revealed enrichment of cell adhesion, granzyme-mediated cytolysis, and T-cell viral response pathways (Supplementary fig. [123]6b and Supplementary data [124]6). Comparison of GO analysis performed with the MSCNV gene sets from each FVB model individually revealed 25 pathways commonly impacted by MSCNV: 11 immune-related, four metabolic, two developmental, and eight uncategorized (Fig. [125]2b, Supplementary fig. [126]6c and Supplementary data [127]7). Notably, one of the uncategorized pathways was Cellular response to nicotine, which we previously identified in a prior large transcriptomic study and showed that exposure to nicotine significantly increased mammary cancer metastasis in a mouse allograft model^[128]11. Fig. 2. FVB driver models have metastasis-specific regions and gene families with CNV. [129]Fig. 2 [130]Open in a new tab a Histogram showing the percent overlap in MSCNV gene lists from each model. Numbers within bars indicate gene number common to all (purple), shared among some (yellow), or unique to each model (green). b DAVID analysis clustering of 1676 MSCNV genes common to all models with enrichment in CNV gain (blue) or loss (red) indicated by −Log10(p value) (dotted line indicates cutoff for significance (≥1.3). Heatmap indicating % lung met-specific gain in blue (% PT gain − % LM gain) and lung met-specific loss in red (% PT loss − % LM loss) for genes within a single genomic region from the c Adaptive immune response, d immune response, and e Granzyme mediated programmed cell death signaling pathway. CNV signal traces from PT and lung mets aligned using Nexus 10 genomic view and the Myc model data set. Interestingly, several of the immune-related pathways contained CNV-altered gene families clustered within a single locus. Furthermore, pathways enriched by CNV had both significant metastasis-specific gain and loss (Fig. [131]2b). A 400 kb region of chromosome 4 was altered in six of the 11 immune-related pathways common to the oncogenic driver cohort MSCNV. Within this region, metastatic samples had genomic gain flanked on either side by genomic loss, but the inverse was observed in PTs (Fig. [132]2c). This locus contains a cluster of 23 interferon genes similar to a region housing several ortholog interferon genes on human chromosome 9 p21.3. Additionally, we identified a 3000 kb region on chromosome 17 with both metastasis-specific gain and loss. Within this locus are the 32 H2 genes that make up the major histocompatibility complex (MHC). Specifically, we observed a common amplification of H2-Q genes and deletion of H2-M genes in metastases and the inverse in PTs (Fig. [133]2d). A third region of interest was observed on chromosome 14 that stretched approximately 150 kb and included seven granzyme (Gzm) genes that can be found within two of the common immune pathways. We observed significant Lung met-specific deletion of this locus across all four FVB driver models (Fig. [134]2e). While several more loci were identified as major contributors to each enriched gene ontology, we noted that PyMT and Myc often aligned, but C3Tag and Her2 samples were more variable (Supplementary data [135]8). We performed a similar analysis for the PyMT-F1 cohort. The 25 common pathways identified from the FVB oncogenic driver model analysis were compared to GO terms enriched in MSCNV from each F1 model individually using several two-group Venn diagram analyses. AKR had the highest overlap of 17 pathways, followed by BALB (6), SEAGN (2), and MOLF (1), while BL6, BL10, and CAST had no overlap (data not shown), suggesting that genes within regions of MSCNV are enriched in a genetic background-dependent manner, as F1s with the highest GO overlap were also the strains most closely related to FVB. Additionally, comparative analysis of gene lists from regions of MSCNV revealed that, except for in BALB F1s, most genes were subject to CNV in more than one strain but no genes were commonly altered in all strains (Fig. [136]3a, Supplementary fig. [137]6d and Supplementary data [138]5). Similarly, comparison of gene ontologies enriched by CNV were largely unique to each PyMT-F1 strain and there were no GO terms enriched by MSCNV across all strains (Fig. [139]3b, Supplementary fig. [140]6e and Supplementary data [141]9). Fig. 3. PyMT F1 models have no common gene sets or ontologies enriched by MSCNV. [142]Fig. 3 [143]Open in a new tab Histograms of the percent overlap in gene lists (a) and GO terms (b) enriched by MSCNV in each model. Numbers within bars indicate gene number common to all (purple), shared among some (yellow), or unique to each model (green). Venn diagram analysis of gene sets (c) and GO terms (d) enriched by MSCNV in low metastatic efficiency strains. e Heatmap indicating % lung met-specific gain in blue (% PT gain − % LM gain) and lung met-specific loss in red (% PT loss − % LM loss) for genes with MSCNV in low metastatic efficiency strains. Venn diagram analysis of gene sets (f) and GO terms (g) enriched in MSCNV from high metastatic efficiency strains. h Heatmap indicating % lung met-specific gain in blue (% PT gain − % LM gain), lung met-specific loss in red (% PT loss − % LM loss), and lung met-specific allelic imbalance (Al. Imb.) in purple (% PT Al. Imb. − % LM Al. Imb.) for genes with MSCNV in high metastatic efficiency strains from the Cell adhesion pathway within a single locus. CNV signal traces from PT and Lung mets aligned using Nexus 10 genomic view of the BL10 and BALB model data sets. We previously reported differing metastatic efficiency of the mouse strains included in this study (Fig. [144]1a and Supplementary data [145]1)^[146]12. Therefore, we next asked if F1-PyMT metastases from strains with either high or low metastatic efficiency might have similar genomic evolution and common MSCNV. To test this, genes within CNV gain, loss, and allelic imbalance, as well as enriched GO terms were compared among strains with low (SEAGN, BL6, and MOLF) and high (BALB, AKR, BL10, and CAST) metastatic efficiency. For low metastatic efficiency strains, we identified four genes commonly altered by MSCNV, but no commonly enriched GO terms (Fig. [147]3c, d). The Chrm4 and Amrba1 genes are found within a region of metastasis-specific genomic gain on mouse chromosome 2 (human chr11 p11.2), whereas Vmn2r29 and Pbrm1 are found on mouse chromosome 7 and 14, respectively, and have MSCNV loss (Fig. [148]3e). In contrast, high metastatic strains revealed no common genes within regions of MSCNV, but all strains had enrichment of the Cell adhesion GO term (Fig. [149]3fg). Heatmap analysis showed high variability between strains for genomic gain or loss of factors within the Cell adhesion ontology gene list, except for a locus on mouse chr18 encompassing a stretch of 19 protocadherin genes (Fig. [150]3h and Supplementary data [151]4). BALB and AKR showed amplification of this region in metastatic tissue compared to PTs, whereas BL10 and CAST showed amplification within the PT DNA not seen in the metastases, therefore resulting in a comparative metastasis-specific loss in this region (Fig. [152]3h). As well as genes within regions of loss and gain, we assessed imbalance between the FVB and non-FVB allele induced by CNV and enriched in Lung mets. Allelic imbalance of a common gene or gene set was not observed across this cohort (Supplementary fig. [153]7a). Additionally, we did not observe allelic imbalance common to strains with either high or low metastatic efficiency (Supplementary fig. [154]7b, c). However, when we observed allelic imbalance within each closely related strain pair, common regions of imbalance were identified. In BALB and SEAGN PyMT F1 mice we observed common allelic imbalance in two regions, Chr1qH3 and Chr11qB4 (Supplementary fig. [155]7d). Interestingly, four genes found within a region of amplification on Chr1qH3 had an allelic imbalance in BALB Lung mets that was not observed in the PTs, but SEAGN showed the inverse with allelic imbalance in the PT not observed in Lung mets. The second region encompassed three genes with allelic imbalance in both BALB and SEAGN PTs that was not enriched in Lung mets (Supplementary fig. [156]7d). These seven genes are similarly arranged within the human genome at loci Chr1 q23.3 and Chr17 p13.2, respectively. MOLF and CAST PyMT F1s also had two common regions of metastasis-specific allelic imbalance. A region of PT and Lung met genomic gain on Chr7q resulted in allelic imbalance in Lung mets from both strains (Supplementary fig. [157]7e). This locus houses 11 genes and corresponds to the human Chr10q26 which contains the same set of ortholog genes. A second region within mouse Chr18q was also a site of genomic gain and common Lung met-specific allelic imbalance, encompassing 10 genes that are also found together within human Chr5q31.3 (Supplementary fig. [158]7e). The third strain pair, BL6 and BL10, both had metastasis-specific allelic imbalance, but no common regions could be identified (Supplementary fig. [159]7f). Taken together, this data suggests that genetic background is a stronger determinant than oncogenic driver in shaping the enrichment of CNV-altered genes in the metastatic genome. Metastatic gene expression is not shaped by CNV We next asked if genes in MSCNV regions were differentially expressed in the metastatic lesions by performing bulk RNA-sequencing (RNA-seq) of RNA isolated from matched PT and Lung met tissue collected from the oncogenic driver and PyMT-F1 cohorts. We found significant variation in the number of genes within regions of CNV that were expressed in metastatic tumors (Fig. [160]4a and Supplementary data [161]10). CAST had the lowest expression with only 8% of MSCNV genes found in the RNA-seq signal, but BALB had the highest with 58%. We next performed differential gene expression analysis between Lung mets and PTs for each model and asked if genes with MSCNV had altered expression compared to PTs (Supplementary data [162]11). Again, we observed high variability within the data, with the FVB-MYC model showing the lowest number of MSCNV genes with altered expression (0.01%), and PyMT F1-AKR showing the highest (33%). When we examined the genes within MSCNV that also had significant differential gene expression compared to the PT, we noted that the expression change did not correspond to the CNV type (gain or loss) (Fig. [163]4b, c). Taken together this shows a poor association between CNV type and direction of any changes in gene expression at the macro metastasis stage (Fig. [164]4d–n). Fig. 4. MSCNV does not significantly impact metastasis-specific gene expression. [165]Fig. 4 [166]Open in a new tab a Histogram of the percentage of genes with MSCNV that are not expressed (blue), expressed in lung mets with no change from PT (pink), and differentially expressed in lung mets compared to PTs (red). Numbers listed before each model name represent the MSCNV gene set size. Histograms showing the percentage of genes with MSCNV-gain (b) or MSCNV-loss (c) that had significant up regulation (green) or down regulation (yellow) in lung mets compared to PTs. Paired volcano plots for the MSGE of genes within regions of MSCNV gain or loss for each FVB driver model (d–g) and PyMT-F1 model (h–n). Oncogenic driver and genetic background shape metastatic gene expression As MSCNV was not a major contributor to metastasis-specific gene expression (MSGE) within metastatic lesions, we analyzed our RNA-seq data to determine how oncogenic driver or genetic background impacted MSGE. To begin, all samples were visualized together by unsupervised principal component analysis (PCA), which resulted in four distinct clusters differentiated by oncogenic driver (Fig. [167]5a). When the analysis was limited to individual models within the oncogenic driver cohort there was no significant separation of PT and Lung met samples (Fig. [168]5b–e). Together, these results indicate that oncogenic driver is a stronger determinant of sample similarity than tissue type or genetic background. Fig. 5. Metastatic gene expression is shaped by oncogenic driver and genetic background. [169]Fig. 5 [170]Open in a new tab a Unsupervised PCA of RNA-seq data from PT and Lung met tissues across all FVB driver and PyMT-F1 mouse models. Unsupervised PCA of individual PT and lung met samples from FVB Her2 (b), FVB C3Tag (c), FVB Myc (d), and FVB PyMT (e). f Histogram of the number of alternatively expressed genes in lung mets vs. PT for each FVB driver model (FDR < 0.05). Volcano plots with significantly up regulated genes in green and down regulated genes in yellow for MSGE in FVB Her2 (g), FVB C3Tag (h), FVB Myc (i), and FVB PyMT (j). k Histogram showing the enrichment score for commonly enriched pathways from MSGE of the C3Tag, Her2, and PyMT models. l Histogram showing the fold change of genes common to pathways enriched by MSGE. Unsupervised PCA of individual PT and lung met samples from BL6 (m), MOLF (n), CAST (o), BALB (p), SEAGN (q), and BL10 (r). s Volcano plots with significantly up regulated genes in green and down regulated genes in yellow for MSGE in the PyMT-F1 models. t Histogram of the number of alternatively expressed genes in lung mets vs. PT for each PyMT-F1 model (FDR < 0.05). u Histogram showing the enrichment score for commonly enriched pathways from MSGE of the F1-PyMT models. Histograms showing the fold change of common MSGE enriched in the Focal Adhesion (v) and Renin Secretion pathways (w). Gene expression analysis comparing Lung met and PTs revealed that the extent to which MSGE differs from the primary tumor is also oncogenic driver-dependent. Using a cut-off of p < 0.05 and FDR < 0.1, the PyMT model had significant changes in the expression of 561 genes, C3Tag 113, Her2 64, and Myc only 4 genes (Fig. [171]5f and Supplementary data [172]11). While the Her2 and Myc models showed an even distribution of up- and down-regulated genes, the PyMT and C3Tag models favored up-regulation of metastasis-specific genes (Fig. [173]5g–j). Alternatively, regulated genes in the C3Tag and Her2 models overlapped with those from PyMT, but only a single gene, Pcdh17, was commonly altered in metastatic tissue from all four models (Supplementary fig. [174]8a, b). According to The Genotype-Tissue Expression (GTEx) project, PCDH17 is highly and specifically expressed in Lung tissue and therefore may simply represent the stroma in which the tumor cells reside (Supplementary fig. [175]8c). However, this is challenged by the observation that Pcdh17 is also expressed in PT tissue in a genetic background-dependent manner in the PyMT-F1 cohort (Supplementary fig. [176]8d). Based on these data, we conclude that the MSGE of these models is largely oncogenic driver-dependent. Similarly, GO analysis revealed that there were no Lung met-specific gene expression pathways common across all oncogenic driver models (Supplementary fig. [177]8e and Supplementary data [178]12). However, it was noted that the Myc model MSGE only enriched a single pathway, the Wnt Signaling pathway. Three pathways, Complement and coagulation cascades, Platelet activation, and Malaria, were significantly enriched and shared by the Her2, C3Tag, and PyMT models (Fig. [179]5k). Interestingly, while these pathways are related by their involvement with red blood cell biology, the gene lists that define them overlap less than 7% (Supplementary fig. [180]8f). Upon assessment of the genes within each pathway, we noted that Her2, C3Tag, and PyMT enrich largely independent gene sets except for the cell adhesion molecules Icam1, Vwf, and Itgb2 (Fig. [181]5l and Supplementary data [182]13). RNA-seq data was next analyzed for metastasis-specific changes in transcript levels to assess model–specific splicing. Upon comparison of Lung mets and PTs from the oncogenic driver cohort, we identified a single transcript significantly altered in metastases compared to PTs and common to all models (Supplementary fig. [183]9a and Supplementary data [184]14). Rnase4 isoform 2, and not isoform 1, was significantly increased in metastatic tissue compared to PT (Supplementary fig. [185]9b, c). Rnase4 isoform 1 and 2 share the same promoter and coding exons but have unique 5’ untranslated regions (UTRs)^[186]24,[187]25. We noted that overall expression of Rnase4 remained unchanged, which may be explained by a compensatory, although non-significant, decrease of isoform 1 (Supplementary fig. [188]9b). We asked if Rnase4 might also be alternatively spliced in lung metastases across the PyMT-F1 cohort. A statistically significant increase in levels of isoform 2 was observed in BALB, SEAGN, and BL6, but not any of the remaining strains (Supplementary fig. [189]9d). Comparative analysis of splicing across the PyMT-F1 cohort revealed no commonly altered transcripts, suggesting that metastasis-specific splicing may be determined in-part by genetic background (Supplementary fig. [190]9e). We next asked how genetic background could shape gene expression programs in macro-metastases. Upon visualization of PT and Lung met gene expression by unsupervised PCA for each PyMT-F1 strain, we observed no distinct separation between tissue types for BL6, MOLF, and CAST (Fig. [191]5m–o), but distinct clustering of samples by tissue in BALB, SEAGN, and BL10 (Fig. [192]5p–r). The heterozygous PyMT F1s had a higher degree of MSGE compared to homozygous FVB PyMT samples, with the number of altered genes ranging between 814 and 3226 for BL6, BL10, SEAGn, MOLF, and CAST (Fig. [193]5s, t and Supplementary data [194]11). Comparison of the MSGE between the F1s using a six-part Venn diagram analysis showed that close genetic background did not always correspond with highest overlap in MSGE (Supplementary fig. [195]10a). MOLF, BL6, and BALB MSGE were most similar to their corresponding strain pairs, CAST, BL10, and SEAGN, respectively, with about 50% of the MSGE in these strains overlapping with the MSGE of their closely related counterparts. However, SEAGN had the highest overlap in MSGE with CAST, while CAST and BL10 strains had a higher percentage of overlap with one another. Furthermore, we noted that MSGE was also highly individualized between F1 strains, with unique changes in gene expression accounting for 23–58% of MSGE (Supplementary fig. [196]10a). Only 60 genes were common to MSGE of all strains (1.9–13.6% of total MSGE for each F1 strain), and these were enriched for extracellular matrix, cell adhesion, autophagy factors, and glycoproteins as determined by DAVID analysis (Supplementary data [197]15)^[198]26,[199]27. Further GO analysis was performed using the MSGE for each PyMT-F1 strain (Supplementary data [200]16). Venn diagram analysis of the enriched pathways revealed ECM-receptor interaction, Gap junction, Focal adhesion, and Renin Secretion as four pathways commonly enriched by MSGE across all PyMT-F1s (Fig. [201]5u). The genes included in the KEGG ECM-receptor interaction and Gap junction pathways overlap with the Focal adhesion pathway by 74% and 30%, respectively, but not one another (Supplementary fig. [202]10b). As such, when we compared the MSGE of each pathway between strains we identified four commonly altered genes (Thbs2, Col1a2, Vwf, and Egflam) from the Focal Adhesion pathway that were also found within either the ECM receptor interaction or Gap junction pathways (Fig. [203]5v and Supplementary data [204]17). Furthermore, three genes (Adrb2, Npr1, and Acer2) from the Renin Secretion pathway were upregulated in the MSGE of all F1 strains (Fig. [205]5w). Based on GTex project data, none of these commonly altered genes are Lung-specific, all are expressed in breast tissue, and some are ubiquitously across many tissues (Supplementary fig. [206]10c). In summary, while the same pathways were enriched by all strains, each F1 appears to modulate gene expression differently to achieve enrichment of these pathways. MSCNV and MSGE from both cohorts target common cellular processes Analyses conducted in this study sought to identify commonalities between models. However, in many instances, particularly among PyMT-F1s, the models were more unique than similar. Indeed, we noted that the oncogenic driver models had 38–72% of MSCNV enriched pathways that were unique to each model, and the altered genetic background of the PyMT-F1 cohort resulted in no commonly enriched genes or pathways from MSCNV, but rather 60–88% unique pathways in each F1 strain (Fig. [207]2a and Fig. [208]3a). To visualize and summarize the differences between models the top 70 GO terms from MSCNV gain and loss were sorted into thirteen categories and the percent of each category within the top pathways presented as heatmaps (Fig. [209]6a and Supplementary data [210]7, [211]S9). Immune (cyan) and metabolic (pink) pathways were dominant across all four oncogenic driver models, in both MSCNV gain and loss. However, each model displayed a unique distribution of pathways, and MSCNV gain and loss were also distinct within the same model, which was most pronounced in the C3Tag model (Fig. [212]6a). MSCNV loss among the PyMT-F1 BALB, AKR, and SEAGN models most closely resembled the distribution observed in the oncogenic driver models, with immune and metabolic pathways dominant, but developmental pathways (gray) were the most significantly enriched across this cohort. From this analysis, we conclude that despite exploiting the same oncogenic driver, genetic background has a stronger influence on the genes enriched within MSCNV (Fig. [213]6a). Fig. 6. Oncogenic driver and genetic background shape unique MSCNV and MSGE to target common cellular processes. Fig. 6 [214]Open in a new tab Categorization of the top 70 (or otherwise indicated) significantly enriched gene ontologies from MSCNV (a), MSGE (b), and lung met-specific alternative splicing (c). Uncategorized ontologies were excluded from the heat maps. Color intensity represents the percent of the analyzed pathways that fell within that category. We next extended this analysis to visualize differences in MSGE from RNA-seq data (Fig. [215]6b). We again observed an enrichment of immune pathways across the oncogenic driver models, except for in the Myc model, which had limited MSGE and only a single enriched pathway (Supplementary fig. [216]8a and Supplementary data [217]12, [218]S16). For the PyMT-F1 cohort, we observed more similarity in metastasis-specific gene expression pathways across strains, compared to the wide variety enriched by CNV (Fig. [219]6a, b). We noted that despite seeing no common developmental pathways, this category made up the majority of MSGE-enriched ontologies among five of the models. In addition, ECM, migration, and proliferative pathways were enriched to varying degrees in the MSGE of each PyMT-F1 strain. Interestingly, the MSGE in the PyMT-F1 cohort represented fewer categories compared to the MSCNV-enriched pathways. Overall, due to the low number of genes and pathways common to MSGE across all PyMT-F1 strains, we conclude that while genetic background can have a profound impact on gene expression within metastatic lesions, these changes may ultimately impact the same cellular processes. This analysis demonstrated that a more diverse genetic background, modeled by the heterozygous PyMT-F1 cohort, shifted MSGE to more developmental and less immune-related ontologies. However, Extracellular matrix (green) genes appear to be commonly modulated by MSGE across all models. Finally, we assessed pathways enriched by metastasis-specific splicing. In both the oncogenic driver and PyMT-F1 cohorts, we noted that fewer pathways were significantly enriched by alternatively spliced genes compared to MSCNV and MSGE (Fig. [220]6c and Supplementary data [221]18). Despite there being no common alternatively spliced transcripts across the eleven models used in this study, we noted that metabolic and RNA processing (purple) ontologies were dominant in all models. However, each model had a unique distribution of pathways significantly enriched by metastasis-specific splicing programs. Discussion Metastasis is a complex phenotype, driven by the interplay of tumor cell intrinsic, microenvironmental, and systemic factors^[222]28. Classic cytogenetics studies and more recent sequencing-based work has shown that metastases usually arise from a single, or at most a limited number, of subclones within the primary tumor^[223]29. In addition, clinical and experimental data suggest that only a small fraction of disseminating tumor cells are able to form secondary lesions^[224]30. These and other data have been interpreted to suggest that tumors evolve over time until a subset of tumor cells acquire all of the necessary attributes to successfully complete the metastatic cascade. To date, however, in contrast to what has been observed in primary tumors, sequencing studies in clinical samples have not revealed the presence of high frequency, reproducible SNVs associated with metastatic progression^[225]31–[226]33. In contrast, previous studies from our laboratory, comparing matched primary and metastatic lesions from the highly metastatic MMTV-PyMT mouse model of breast cancer, did identify reproducible metastasis-associated point mutations in Shc1 and Kras. These studies, however, were performed on a much more genetically homogeneous set of animals than is observed in the human population, suggesting that these potential metastasis driver mutations might only represent a small fraction of the patient population. In addition, these observations were made for a single oncogenic driver model, PyMT, rather than the multiple oncogenic initiating events observed for human patients. To better understand how genomic evolution may contribute to metastatic progression in commonly used GEMMs and to compare them to the expanding clinical data, we have extended our previous work to include two additional models of breast cancer metastasis, the C3(1)TAg and MMTV-Myc models, and expanded on the previously analyzed MMTV-Her2 model. Moreover, for the MMTV-PyMT model, we have expanded the repertoire to include pairs of phylogenetically related strains with significantly different inherited propensity to metastasize to include population diversity into the analysis. The total data set analyzed in this study consists of 197 matched primary and metastatic lesions spanning models of luminal, Her2, and basal-like breast cancer. Consistent with our previous study, analysis of the expanded data set failed to reveal high frequency SNVs associated with metastasis across all of the models. However, a total of 13 genes were identified that were mutated in 2 or more independent animals, The most frequently observed gene was Kras, which had metastasis-associated activating mutations in 5 out of 197 animals. All of the remaining genes with metastasis-associated SNVs were present in only 2 or 3 animals. Reproducible metastasis-associated SNVs were not observed in 84% (166/197) of the animals. This suggests that SNVs do not play a significant role in metastatic progression. A similar result was observed for synchronous lymph node metastasis in human patients^[227]34, though metastasis-specific SNVs were observed in the metachronous distant metastases. However, since the metachronous lesions were collected after anthracycline treatment, it is unclear whether these SNVs represent metastasis drivers or therapeutic response. A second study by Siegel et al. ^[228]35 also concluded that the majority of metastasis drivers were copy number alterations, rather than SNVs. Furthermore, Hoadley et al. ^[229]36 also found few metastasis-specific mutations, which were rarely expressed, and concluded that these SNVs were more likely passengers than drivers, consistent with our results. However, our ability to identify low frequency metastasis-associated SNVs in mouse models, compared to human clinical studies, suggests that there may be either rare SNVs in particular patient populations, or many SNVs that contribute at a low frequency to metastatic progression, which are more readily apparent through sequencing of multiple genetically identical independent sample pairs. Deeper sequencing and significantly increasing the number of samples per group may enable the identification of additional low frequency metastasis drivers, particularly those that might be selectively lost at the secondary site due to clonal selection. The mouse data generated here, however, is consistent with current clinical data suggesting that there is a lack of high frequency activating mutations for precision medicine for anti-metastatic therapies that would be of benefit for significant fractions of breast cancer patients. We subsequently performed CNV analysis on the data to determine whether there were common CNVs that might contribute to metastatic progression. Analysis revealed that oncogenic drivers on the FVB homozygous genetic background had similar degrees of CNV in both primary and secondary lesions, though there were differences in the size of the CNVs between models. Further, in both PTs and lung metastases we observed some strains with high numbers of small events and others with small numbers of larger events, accounting for the similar degree of copy number change overall. Notably, the size of copy number events in this study are within the range observed across cancer types by Steele et al. in a recently published analysis of copy number signatures from clinical data^[230]37. The major determinant of CNV variation, however, was genetic background. This was most apparent in the comparison of the phylogenetically related BALB and SEAGn strains. The SEAGn strain was generated by a cross between BALB and PJ mice, followed by 20 generations of brother-sister mating^[231]38. This congenic strain genome is 50% BALB yet shows significantly less genomic stability compared to the BALB primary tumor. Together this data suggests that genetic background may direct the mechanisms leading to copy number events. Further analysis focused on metastasis-associated events revealed common CNVs, associated genes, and gene ontologies across the oncogene driver models on the FVB homozygous background. However, these common metastasis-associated features are lost when the genetic background was varied in the MMTV-PyMT cohort. This suggests that while the analysis of the four homozygous FVB oncogenic models may be valuable for identification of some common metastatic mechanisms, they are unlikely to represent the majority of the human patient population. Moreover, consistent with our previous observations of inherited metastatic susceptibility, this suggests that an individual’s personal ancestry is a major factor in metastatic genome evolution. In addition, these data suggest that there are likely multiple paths that tumor cells utilize to achieve metastatic competency, which indicates that multiple strategies to prevent or treat established lesions may be required across the human population. Similar to the CNV analysis, transcriptional analysis demonstrated that the most significant determinants of gene expression were oncogenic driver and genetic background. Surprisingly, CNV status did not show strong correlation with metastasis-specific gene expression when analyzing macroscopic metastatic lesions. One possible explanation for this is that alterations in gene expression due to CNV are important during the earliest stages of dissemination, extravasation and adaptation to the secondary site. Once resident in the secondary site epigenetic regulation mediated by microenvironmental influences of the pre-metastatic niche and metastatic tumor bed may reprogram and dominate the transcriptional program of the macroscopic metastatic tumors. Additional investigations into the epigenetic state of the matched primary and metastatic tumors in these models, however, will be necessary to address this possibility. Examination of the data at a broader level provides cause for optimism. Although individual Gene Ontology terms from CNV and transcriptome analysis were not shared across the models and genotypes, grouping individual ontologies into broader categories demonstrates potential commonalities that might be targeted. When considered as broad categories, developmental pathways, ECM, RNA biology, and metabolism appear to be common metastatic categories across all of the animals, although different Gene Ontologies are affected in each of the cohorts (Fig. [232]7). Each of these broad categories have already been previously associated with metastatic disease. Importantly, a recent sequencing study also identified metabolism and RNA biology ontologies distinguishing matched primary and metastatic breast cancer samples^[233]35, providing greater confidence that the results reported here are translationally relevant. The data presented here suggests that to generate therapies that have broad applicability across patient populations more in-depth analysis of these categories may be necessary. In particular, since the models used here appear to rely on different pathways within each of the broad ontology categories, targeting any single pathway is unlikely to be efficacious in a large percentage of the patient population. Further investigations into pathway intersections within the broad metastasis categories, however, might identify critical convergence points that might be targeted in either single or combination therapies to suppress metastasis establishment or metastatic lesion viability. Fig. 7. Host genetic background shapes metastatic SNVs, CNVs, and gene expression more than oncogenic driver. [234]Fig. 7 [235]Open in a new tab Schema summarizing a murine model genetic background and oncogenic driver and the unique (black diamonds) or common (white boxes and solid lines) metastasis specific b genomic alterations and c gene expression. This study has focused on the characterization of the genomic evolution of commonly used mouse models of breast cancer progression and metastasis. A major strength of this study is the ability to generate multiple genetically identical animals for each model and genotype, enabling more accurate dissection of the role of oncogenic driver and genetic background on tumor evolution and transcriptional output. Limitations of this study include potential biological differences between the two species. In addition, many of the features observed in these models recapitulate events observed in human cancer progression. For example, human tumors also experience significant copy number alterations, with up to 25% of the genome of all human tumors acquiring somatic amplifications of deletions. Moreover, like the mouse tumors examined here, human tumors tend to accumulate a higher number of copy-number alterations than single nucleotide mutations^[236]39. Furthermore, human circulating tumor cells often have copy-number alterations that more closely resemble metastatic lesions rather than the primary tumors^[237]40, suggesting that copy-number alteration provides a selective advantage for metastatic progression. Increased copy-number alteration has also been associated with increased probability of metastasis and mortality in prostate cancer^[238]41, consistent with the data presented here. Limitations of this study, however, include potential differences between the two species. Mouse and human immune and metabolic systems are known to have differences, which may potentially contribute to differences in the important tumor-stromal cell interaction during metastatic dissemination and colonization. In addition, the mechanism of copy-number alteration may differ between mouse and human. Whole genome duplication is a frequent event in human metastatic tumors^[239]23. However, analysis of the primary and metastatic lesions in the mouse models suggests that these tumors had not experienced whole genome duplication. In addition, although the use of multiple inbred strains has introduced genetic diversity for the MMTV-PyMT model, the remaining three models studied here were limited to the FVB/N homozygous background, which precluded the ability to investigate the role of genetic diversity in these models. In addition, the metastases examined here were limited to pulmonary lesions and for most animals limited to a single metastatic lesion per animal. This precluded investigations into the potential tissue-specific differences as well as performing phylogenetic tracing analysis. Despite these limitations, here we have demonstrated that (1) Genetic background plays a role in primary tumor stability and the size of copy number event regions, (2) Genetic background is a strong determinant of which genes are specifically enriched in CNV, (3) Oncogenic driver and genetic background shapes gene expression in metastatic tumors, and (4) Metastasis-specific ontology clusters, rather than individual Gene Ontologies, identify common biological programs in metastatic disease that might be considered for clinical intervention (Fig. [240]7). An increased understanding of the complex interplay between these factors will provide guidance for the improved use and interpretation of these models in relation to human patient populations for the investigation of progressive metastatic disease. Methods Ethics statement We have complied with all relevant ethical regulations for animal use. The research described in this study was performed under the Animal Study Protocol LPG-002, approved by the National Cancer Institute (NCI) Animal Use and Care Committee. Animal euthanasia was performed by cervical dislocation after anesthesia by Avertin. Genetically engineered mouse models and tissue isolation FVB/N-Tg(MMTV-PyVT)634Mul/J (PyMT), FVB/N-Tg(MMTVneu)202Mul/J (Her2) male mice were obtained from The Jackson Laboratory. FVB/N-Tg(C3(1)-TAg) (C3-TAG) and FVB/N-Tg(MMTV-Myc) (MYC) male mice were a generous gift from Dr. Jeffrey Green (NCI, Bethesda, MD). Male PyMT mice were crossed with female wild-type FVB/NJ, MOLF/EiJ, CAST/EiJ, C57BL/6J, C57BL10/J, BALB/cJ, Sea/GnJ, and AKR/J mice, which were also obtained from The Jackson Laboratory. Male Her2, C3-TAG, and MYC mice were crossed with female wild-type FVB/NJ mice. All female F1 progeny were genotyped by the Laboratory of Cancer Biology and Genetics genotype core for the PyMT, Her2, Myc, or C3-TAG gene. Transgene-positive female animals were housed in standard isolator cages and provided food and water ad libitum. PyMT-positive female animals were aged to permit tumor initiation and growth until animals reached humane endpoints or were expected to reach humane endpoints by the next daily monitoring. For the MMTV-Myc, MMTV-Her2 and C3(1)-TAg models, tumors from animals with single lesions were surgically resected at or before tumors reached 1 cm in any direction. Animals were then aged an additional 60 days or until they approached humane endpoints, whichever came first. All animals were housed in the same room in the building 37 vivarium. Mice were euthanized using intraperitoneal Avertin to anesthetize followed by cervical dislocation. All primary tumors generated by one animal were isolated, weighed, randomly sampled, and combined into a single cryovial. Metastatic nodules and normal (tail) tissue were also isolated immediately following euthanasia, placed in cryovials, and snap frozen in liquid nitrogen. Tissue samples were then stored at −80 °C. Humane endpoints included hunched posture, rough hair coat; dehydration (reduced skin turgor, sunken eyes); rapid or labored breathing, dyspnea, coughing; reduced/impaired mobility affecting the ability to obtain food or water; pallor or cyanosis (evident in the general appearance of the feet, muzzle or ears which may turn white or blue); hemorrhage or bleeding from any orifice; diarrhea, constipation or markedly reduced food intake; signs of neurological impairment for example seizures, paralysis, circling, head tilt; impaired ability to urinate or defecate; visible jaundice (yellow discoloration to ears, muzzle, or feet); loss of >15% normal body weight. Additional humane endpoints included tumors interfering with normal movement that impaired their ability to perform bodily functions, significantly impaired gait, or impaired their ability to obtain food or water; tumors interfering with normal bodily functions such as urination, defecation, breathing, chewing, or swallowing; or dyspnea at cage side observation. gDNA and RNA isolation The combined primary tumor tissue from one mouse was ground on dry ice and small fragments were taken for nucleotide extraction isolation. Whole metastases were used for gDNA or RNA isolation. For gDNA isolation, tissue was lysed using Tail Lysis Buffer (100 mM Tris-HCl pH 8.0, 5 mM EDTA, 0.2% SDS, 200 mM NaCl, 0.4 mg/ml proteinase K) at 55 °C overnight. Samples were then placed in a shaking (1400 rpm) heat block for 1 h at 55 °C. RNaseA (Thermo Fisher Scientific) was added (2 mg/ml final) and lysates were incubated on the bench for 2 min. gDNA was then isolated using the ZR-Duet DNA/RNA MiniPrep kit (Zymo Research). For RNA isolation, the tissue was mechanically dissociated using a tissue grinder while submerged in 1 ml of TriPure (Roche). 200 μl of chloroform (Sigma-Aldrich) was then added and the soluble fraction was isolated by centrifugation at 12,000 rpm for 15 min at 4 °C. RNA was then precipitated with the addition of 500 μl isopropanol and incubation of the sample at −20 °C for 2 h. Pure RNA was then extracted using the RNA: DNA mini-prep kit (Zymogen) and samples were eluted in 100 μl (PT) or 50 μl (metastases) of DEPC water (Quality Biological). RNA was isolated from cell lines using TriPure as described above, but following isopropanol precipitation, RNA was washed with 75% ethanol (Sigma-Aldrich) followed by 95% ethanol before being resuspended in 100 μl of DEPC water. All samples that passed initial quality control were used for sequencing analysis. No samples were excluded for any reason other than low yield or poor quality. No randomization was performed. Sequencing and analysis All analyses were carried out on the NIH Biowulf2 high performance computing environment. All analyses were performed using software default parameters if not otherwise specified. Exome sequencing and data analysis Exome sequencing was performed by the NCI Center for Cancer Research (CCR) Genomics Core and the NCI Illumina Sequencing Core. Exome libraries were prepared by the Genomics Lab using Agilent SureSelectXT Mouse All Exon target enrichment kit. Libraries were barcoded and pooled before sequencing on an Illumina HiSeq3000 or HiSeq4000 to an average depth of 40×. Samples were trimmed of adapters using Trimmomatic software. Bam files were uploaded into Nexus Copy Number 10.0 (BioDiscovery) and processed using SNP-FASST2 Segmentation. The trimmed reads from tumor gDNA were aligned to normal tail gDNA sequence from each model, and then to the mouse NCBI build 38 (mm10) reference genome. RNA sequencing and data analysis RNA quality was tested using the Agilent 2200 TapeStation electrophoresis system, and samples with an RNA integrity number (RIN) score >7 were sent to the Sequencing Facility at Frederick National Laboratory. Preparation of mRNA libraries and mRNA sequencing was performed by the Sequencing Facility using the HiSeq2500 instrument with Illumina TruSeq v4 chemistry. RNA-seq data were mapped to mm10 and analyzed using Partek Flow software. Single nucleotide variant analysis This analysis was performed as described in our previously published work^[241]6. Copy number analysis Following initial data processing, the Total copy number events and % of the genome changed by copy number for each sample were downloaded from the Data Set tab. Aggregate data from PTs and lung mets was attained by selecting only the sample from the desired tissue type within the Data set tab, and then selecting View and then Aggregate. Aggregate regions of CNV were defined as having a p value cutoff below 0.05 and an aggregate percent frequency of 35% and above. Region length, event type, number of genes, and gene names were exported as .txt from aggregate analysis. Average region length, total number of events and total number of genes were calculated using Excel (Microsoft). For lung met-specific CNV analysis, the Classic comparison tool was utilized in Nexus. In the Data Set tab, all samples from one mouse model were selected before navigation to the Comparisons tab. From here, a comparison was added using Tissue as the distinguishing factor, and Average was selected as the comparison baseline. Upon selection of the comparison and then View, lung met specific event type, region length, p value, number of genes, and gene names were exported as .txt files. Average region length, total number of events, and total number of genes were calculated using Excel. By highlighting specific CNV events within the comparison View tab, a pathway enrichment analysis was performed for the genes found within regions of CN Gain or CN Loss. A list of enriched Biological Processes was exported as a .txt file including p values. The percent CN gain, loss, or allelic imbalance in the primary tumors and lung mets for each gene within each pathway was obtained by selecting the individual biological process name. Calculation of % Lung met gain or loss − % PT gain or loss was performed in Excel and heat maps were created using the Excel conditional formatting tool. ABSOLUTE whole genome duplication analysis Reference files were trimmed using the Trim galore^[242]42, before being aligned to the mm10 genome using the Bwa-mem2 package^[243]42,[244]43. We then used Gatk to sort, deduplicate, and recalibrate the resulting files^[245]44. Using Gatk’s Mutect2 somatic variant caller, SNPs were called for all tumor-normal pairs. The R packages copywriter and DNAcopy were used to detect Copy Number Aberrations within tumor samples and format the data for integration with our SNP data^[246]45,[247]46. We used the R package BubbleTree to integrate CNA and SNP information to analyze tumor purity and ploidy^[248]47. RNA sequencing and data analysis RNA quality was tested using the Agilent 2200 TapeStation electrophoresis system, and samples with an RNA integrity number (RIN) score >7 were sent to the Sequencing Facility at Frederick National Laboratory. Preparation of mRNA libraries and mRNA sequencing was performed by the Sequencing Facility using the HiSeq2500 instrument with Illumina TruSeq v4 chemistry. RNA-seq data was analyzed using Partek Flow software (Kanehisa Laboratories). RNA-seq reads were uploaded into Partek Flow, aligned with the mouse mm9 genome assembly, gene counts were then determined, and then normalized. Principal Component Analysis was performed using unsupervised principal component analysis for either all samples, or samples filtered by mouse model. Differentially expressed gene lists were generated Gene Specific Analysis (GSA) for FVB-Driver models and Dseq2 for PyMT-F1 models. Differential gene expression was defined as a fold change with a false discovery rate less than 0.05. The Differential expression of genes within regions of CNV was defined as fold change with FDR less than 0.1. Volcano plots for Lung met v PT gene expression were created using unfiltered GSA results for all models. Gene ontology analysis was performed using differentially expressed gene lists and the Partek Flow Gene Set Enrichment tool. Pathways enriched with a p value less than 0.05 were considered significant. Gene ontology categories were assigned manually to the top 70 pathways or if less than 70 pathways were significantly enriched then all pathways with p < 0.05 were included. Heatmaps were created using the Excel conditional formatting tool, where color intensities were determined by the percent of the enriched terms found within each category out of the total terms categorized. Differential splicing analysis was performed using Partek flow. Following alignment to mm9 genome assembly, transcript counts were normalized and then each model was analyzed using the Partek tool to detect all splicing and differential splicing in Lung mets v PT. Alternative splicing was defined as both transcript differential expression with p value less than 0.05 and an alt-splicing p value less than 0.05. Using these p value cuts-offs, alternatively spliced gene lists were analyzed by Partek Gene Set Enrichment to identify pathways enriched by lung met alternatively spliced genes. Significantly enriched pathways were defined as those with p values less than 0.05. Statistics and reproducibility Statistical analysis was performed using GraphPad Prizm software, using t tests or ANOVA analysis with correction for multiple testing, as required. Sample numbers can be found in Supplementary Table [249]1. Reporting summary Further information on research design is available in the [250]Nature Portfolio Reporting Summary linked to this article. Supplementary information [251]Supplementary Figs.^ (17.3MB, pdf) [252]Supplementary data 1–18^ (10.7MB, zip) [253]42003_2025_8624_MOESM3_ESM.docx^ (20.7KB, docx) Description of Additional Supplementary Files [254]Reporting Summary^ (1.4MB, pdf) Acknowledgements