Abstract Breast cancer imposes a significant burden globally. While the survival rate is steadily improving, much remains to be elucidated. This observational, single time point, multiomic study utilizing genomics, proteomics, targeted and untargeted metabolomics, and metagenomics in a breast cancer survivor (BCS) and age-matched healthy control cohort (N = 100) provides deep molecular phenotyping of breast cancer survivors. In this study, the BCS cohort had significantly higher polygenic risk scores for breast cancer than the control group. Carnitine and hexanoyl carnitine were significantly different. Several bile acid and fatty acid metabolites were significantly dissimilar, most notably the Omega-3 Index (O3I) (significantly lower in BCS). Proteomic and metagenomic analyses identified group and pathway differences, which warrant further investigation. The database built from this study contributes a wealth of data on breast cancer survivorship where there has been a paucity, affording the ability to identify patterns and novel insights that can drive new hypotheses and inform future research. Expansion of this database in the treatment-naïve, newly diagnosed, controlling for treatment confounders, and through the disease progression, can be leveraged to profile and contextualize breast cancer and breast cancer survivorship, potentially leading to the development of new strategies to combat this disease and improve the quality of life for its victims. Keywords: breast cancer survivors, breast cancer, multiomics, genomics, metagenomics, metabolomics, proteomics, microbiome, omega-3 fatty acids, aptamer 1. Introduction Breast cancer (BrCa) remains a formidable global health challenge, affecting more than two million women and their families each year and leading to almost 700,000 deaths globally [[52]1]. In 2020, female breast cancer had the highest incidence rate of all cancers in both sexes and in female-specific cancers, accounting for 11.7% of all cancers and 24.5% of female cancers [[53]1]. In the United States, breast cancer accounted for an estimated 31% of new female cancer cases and an estimated 15% of female cancer-related deaths in 2022 [[54]2]. As such, women with breast cancer represent the largest sub-population of those with cancer in the United States and worldwide. There exists a body of literature on molecular analytics and therapeutics in breast cancer [[55]3,[56]4,[57]5,[58]6,[59]7,[60]8,[61]9,[62]10,[63]11,[64]12,[65]13, [66]14,[67]15,[68]16,[69]17], including an emergent body of literature with multiomics applied to biopsy specimens in breast cancer [[70]4,[71]6,[72]9,[73]10,[74]11,[75]14,[76]15,[77]17]. However, there is a paucity of research in which multiomic analyses have been applied to breast cancer survivors (BCS) who have completed treatment. The present study was established as part of a two-step investigation to address this need. The first step, reported herein, was designed to apply a comprehensive multiomic analysis (using untargeted genomics, proteomics (serum), metabolomics (plasma, urine, stool), gut metagenomics, and polygenic risk scores) to a cohort of BCS derived from the Mayo Clinic Breast Cancer Registry. This includes the most detailed NextGen sequencing of the gut metagenome in BCS to date. The primary aim was to develop a comprehensive molecular atlas of a single time point comparing a BCS cohort with an age-matched control group of individuals with no history of breast cancer. The resultant molecular atlas would form the foundation upon which future multiomics investigations into a larger BCS cohort might be based. The second step was envisioned to refine the paradigm and apply those lessons to a multiomic study of those newly diagnosed with breast cancer and prior to treatment initiation. An acknowledged limitation of the present study is the confounding effect of treatment on the molecular dynamics in BCS, wherein it is not possible to draw conclusions about how the molecular findings may or may not have contributed to the evolution of the disease. However, the present study does lend insight into whether the molecular phenotypes in the BCS vs. the control cohorts were convergent or divergent. In general, the application of the methods reported in this paper is expected to, in the future, provide the ability to describe the molecular dynamics in those newly diagnosed with breast cancer in greater detail than has been reported previously. From this study, and future studies building upon this platform, new insights and hypotheses can emerge which strengthen, deepen, and potentially improve the way breast cancer is approached, before, during, and after diagnosis. 2. Materials and Methods 2.1. Study Design, Recruitment, Inclusion/Exclusion Criteria, and Informed Consent In this study, we performed advanced multiscale omics analysis coupled with gut microbiome metagenome and gut microbiome metabolome data analyses in Breast Cancer Survivors (BCS). This included univariate, multivariate, and pathway analysis applied to large datasets developed to detect unique patterns of variance in targeted serum metabolites, plasma metabolites, gut microbiome community structure, gut microbiome metabolome, urine metabolome, and quality of life measures. These measures have aided in the development of a database that can be used as a reference population for precision medicine in BCS. Cases and controls were recruited to the study via mail or in person. Each study packet included an informed consent document, an invitation letter, and a return envelope. For BCS cases, eligible participants were adult females enrolled in the Mayo Clinic Biospecimen Resource for Breast Disease (IRB # 1815-04), current ages 18–75 years, with a prior diagnosis of stage 0–3 (in situ or invasive) breast cancer (BCS) who had completed active therapy (surgery, radiation, and/or chemotherapy) approximately 8–30 months prior to provision of their samples. For trial controls, age-matched (within 0–5 years depending upon the convenience and availability) females with no history of cancer (other than non-melanoma skin cancer) were selected from the PRISM Study (IRB #18-002366 The Predicting Risk after Screening Mammogram (PRISM) Study) or the Mayo Clinic Breast Mammography practice. Exclusion criteria for both cohorts (BCS cases and trial controls) exclusion criteria included: male biological sex or gender, pregnant females, or participants unwilling to travel to Mayo Clinic Rochester for the study visit or unwilling to provide mail-in stool samples. Participants came to the Mayo Clinic Clinical Research and Trials Unit (Mayo Clinic CRTU) in Rochester, MN, USA. At the CRTU, participants were asked to sign the consent form, their height and weights were then obtained, questionnaires were completed, and the participants’ samples were obtained. Some data was gathered via electronic health record (EHR) or was from other studies in which these individuals had participated (Mayo Clinic Biospecimen Resource for Breast Disease (IRB # 1815-04) or IRB #18-002366: The Predicting Risk after Screening Mammogram (PRISM Study)). Both avenues of additional data collection were consented to by participants prior to enrollment in this study. Descriptive statistics (mean; median; SD; IQR) were obtained for BCS and HC for parametric variables. For PHQ-8 and GAD-7, the mean and standard deviation were analyzed. For significance testing of parametric variables, Student’s t-tests were conducted. For non-parametric statistics, the Wilcoxon Rank-Sum Test was performed. All statistical analysis for metadata was performed using R version 4.2.3. 2.2. Whole Genome Sequencing The WGS protocol used in this experiment was similar to the protocol previously described by Meydan et al. and will be briefly summarized here [[78]18]. Stool samples were collected using Thorne’s Gut Health Test, which provides metagenomic sequencing of the microbiome. To isolate microbial DNA from the samples, an automated protocol, and MoBio’s PowerMag^® (+CleaMag^®) microbiome DNA isolation kit were utilized on the KingFischer^TM Flex Instrument. The concentration of extracted DNA from each sample and estimated sample purity were determined by Qubit measurement and spectrophotometry (A260/280 and A260/230 absorbance ratios), respectively. Nextera XT Library Prep (Illumina, San Diego, CA, USA) was used to enzymatically fragment and tag primer sites for adapter index addition, creating next-generation sequencing libraries. Sequencing adapters and indices were added during polymerase chain reaction (PCR) amplification, followed by library verification by fragment analysis (Agilent Bioanalyzer, Agilent, Santa Clara, CA, USA). DNA sequencing was performed utilizing the Illumina NextSeq platform, which generated 350 M reads per sample (150 × 150 read length) and translated to 24× coverage. Whole genome sequences were aligned to the hg38 reference genome using BWA mem with the default parameters [[79]19]. Sentieon LocusCollector and Dedup algorithms were used to remove duplicate reads [[80]20]. After indel realignment and base recalibration using Sentieon default parameters, Somatic SNVs, and SVs were called with the DNAscope algorithm [[81]21]. Known variants were annotated using dbSNP SNVs, Indels, and Mills, and 1000 genomes indels [[82]22,[83]23]. After variant calling, variants were filtered by QC, including removing any sites with p > 0.25 or marked as low quality. Polygenic risk score (PRS) matrices were collected from PGS Catalog [[84]24] for signatures of breast cancer-specific scores and other breast cancer subtypes, as well as genetic signatures for 54 health and disease scores [[85]25]. PRS for each individual was calculated by scoring the variants of the subject using the PGS Catalog calculator (PGS Catalog Calculator (in preparation [0]). PGS Catalog Team. PGScatalog/pgsc_calc). Statistical comparison of healthy controls versus breast cancer survivors was performed on these scores using the Wilcoxon rank-sum test. 2.3. Proteomics via Aptamer Proteomic profiling was conducted at SomaLogic Operating Co., Inc. (Boulder, CO, USA), using the SomaScan Assay. The SomaScan method has been previously described in detail by Gold and colleagues and Kim et al. [[86]26,[87]27]. Briefly, SOMAmer (Slow Off-rate Modified Aptamers) reagents consist of short single-stranded DNA sequences with a high binding affinity for proteins. The diverse chemical properties allow for a greatly expanded library from which SOMAmer reagents are selected. SOMAmer reagents are discovered in vitro using the SELEX (Systemic Evolution of Ligands by EXponential enrichment) process, which incorporates chemically modified nucleotides with structural similarity to desired amino acid side chains. This enhances the specificity and affinity of the target protein binding interaction. Initially, samples were diluted and incubated with SOMAmer reagent mixes attached to streptavidin (SA)-coated beads. The beads had been washed and tagged with an NHS-biotin reagent. SOMAmer complexes and unbound SOMAmer reagents were subjected to ultraviolet light to cleave a “photo-cleavable” linker within the SOMAmer reagent which released them into a solution containing an anionic competitor. The anionic competitor displaced the SOMAmer reagents in the eluent, which was then incubated with a second SA-coated bead. Subsequent washing removed the free SOMAmer regent, followed by a final elution step, in which the protein-bound SOMAmer reagent was released from the proteins when mixed with a denaturing agent. Quantification of SOMAmer reagents was achieved by hybridization to custom DNA microarrays, which were detected via their cyanine-3 signal. Principle component analysis was performed to reduce the dimensionality of a dataset by identifying and retaining the most significant features. The distance matrix from generalized unifrac metrics was fed into t-distributed stochastic neighbor embedding (t-SNE) to project the samples into a two-dimensional space [[88]28]. Log-normal univariate (Model A) and multivariate modeling (Model B and Model C) were performed. Model A was an uncorrected comparison for BCS versus healthy cohort. Model B corrected for age and menopause status and Model C corrected for age, menopause, and treatment type. Pathway enrichment analysis was performed, and the top 50 pathways were presented for each model. 2.4. Targeted Metabolomics 2.4.1. Acylcarnitines The following acylcarnitines were analyzed in human plasma samples: carnitine, 2-methyl butyryl carnitine, 3-hydroxy butyryl carnitine, acetylcarnitine, butyryl carnitine, decanoyl carnitine, hexanoyl carnitine, isobutyryl carnitine, isovaleryl carnitine, lauroyl carnitine, linoleoyl carnitine, myristoyl carnitine, octanoyl carnitine, oleoyl carnitine, palmitoyl carnitine, propionyl carnitine, stearoyl carnitine, and valeryl carnitine. Samples were analyzed via LC-MS/MS by Metabolon, Inc. (Morrisville, NC, USA). The plasma samples were spiked with stable, labeled internal standards, homogenized, and subjected to protein precipitation with organic solvents. Following centrifugation, an aliquot of the supernatant was injected into an Agilent 1290/AB Sciex QTrap 5500 LC-MS/MS system equipped with a C18 reversed-phase UHPLC column. The mass spectrometer was operated in positive mode using electrospray ionization (ESI). The peak of the individual analyte product ions was measured against the peak of the corresponding internal standards. Quantitation was performed using a weighted linear least squares regression analysis generated from fortified calibration standards and prepared immediately prior to each run. LC-MS/MS raw data were collected and processed using AB SCIEX software Analyst 1.6.3 and SCIEX OS-MQ software v1.7. Data reduction was performed using Microsoft Excel for Office 365 v.16. 2.4.2. Bile Acids A Human Plasma Bile Acid Panel was performed, which measures all the primary and secondary bile acids and their conjugates. Bile acid concentrations were analyzed by LC-MS/MS (Metabolon Method TAM178: “LC-MS/MS Method for the Quantitation of Bile Acids”). Bile acids measured included: cholic acid (CA), chenodeoxycholic acid (CDCA), deoxycholic acid (DCA), lithocholic acid (LCA), ursodeoxycholic acid (UDCA), glycocholic acid (GCA), glycochenodeoxycholic acid (GCDCA), glycodeoxycholic acid (GDCA), glycoursodeoxycholic acid (GUDCA), taurocholic acid (TCA), taurochenodeoxycholic acid (TCDCA), taurodeoxycholic acid (TDCA), taurolithocholic acid (TLCA), tauroursodeoxycholic acid (TUDCA), and lycolithocholic acid (GLCA). Calibration samples were prepared at eight different concentration levels by spiking a phosphate-buffered saline/bovine serum albumin (PBS/BSA) solution with a corresponding calibration spiking solution. Calibration samples, study samples, and quality control samples were spiked with a solution of isotopically labeled internal standards and subjected to protein separation with acidified methanol (organic solvent). Following centrifugation, an aliquot of supernatant was evaporated and dried in a stream of nitrogen. The dried extracts were reconstituted and injected on an Agilent 1290 Infinity/SCIEX QTRAP 6500 LC-MS/MS system equipped with a C18 reverse phase UHPLC column. The mass spectrometer was operated in negative mode using electrospray ionization (ESI). The peak area of each bile acid parent (pseudo-MRM mode) or the product ion was measured against the peak area of the respective internal standard parent (pseudo-MRM mode) or the product ion. Quantitation was performed using a weighted linear least squares regression analysis generated from fortified calibration standards prepared immediately prior to each run. LC-MS/MS raw data were collected using SCIEX software Analyst 1.7.3 and processed using SCIEX OS-MQ v.1.7. Data reduction was performed using Microsoft Excel for Office 365 v.16. 2.4.3. Fatty Acid Panel In order to assess red blood cell (RBC) fatty acids, dried blood spot (DBS) samples were collected according to manufacturer instructions (OmegaQuant Analytics, LLC.; Sioux Falls, SD, USA). Samples were processed and analyzed according to OmegaQuant Omega-3 Index^® methodology, as follows. A drop of blood was collected on filter paper that was pre-treated with a cocktail solution (Fatty Acid Preservative Solution, FAPS™) and allowed to dry at room temperature for 15 min. Following drying, samples were kept at 2–8 °C. The DBS was shipped to OmegaQuant for fatty acid analysis. One punch of the DBS was transferred to a screw-cap glass vial followed by the addition of BTM (methanol containing 14% boron trifluoride, toluene, methanol; 35:30:35 v/v/v; Sigma-Aldrich, St. Louis, MO, USA). The vial was briefly vortexed and heated in a hot bath at 100 °C for 45 min. After cooling, hexane (EMD Chemicals, Gibbstown, NJ, USA) and HPLC-grade water were added. Subsequently, the tubes were recapped, vortexed, and centrifuged to help separate layers. An aliquot of the hexane layer was transferred to a gas chromatography (GC) vial. GC was carried out using a GC-2010 Gas Chromatograph (Shimadzu Corporation, Columbia, MD, USA) equipped with an SP-2560, 100-m fused silica capillary column (0.25 mm internal diameter, 0.2 um film thickness; Supelco, Bellefonte, PA, USA). Fatty acids were identified by comparison with a standard mixture of fatty acids characteristic of RBC (GLC OQ-A, NuCheck Prep, Elysian, MN, USA) which was also used to construct individual fatty acid calibration curves. The following 24 fatty acids (by class) were identified: saturated (14:0, 16:0, 18:0, 20:0, 22:0 24:0); cis monounsaturated (16:1, 18:1, 20:1, 24:1); trans (16:1, 18:1*, 18:2*—see below for more details); cis n-6 polyunsaturated (18:2, 18:3, 20:2, 20:3, 20:4, 22:4, 22:5); cis n-3 polyunsaturated (18:3, 20:5, 22:5, 22:6). Fatty acid composition was expressed as a percent of total identified fatty acids. The Omega-3 Index is defined as the sum of 20:5n-3 (EPA) and 22:6n-3 (DHA) adjusted by a regression equation (r = 0.97) to predict the Omega-3 Index in the RBC. * The chromatographic conditions used in this study were sufficient to isolate the C16:1trans isomers and the C18:2 D 9t-12c, 9t-12t, and 9c-12t isomers; the latter is reported as C18:2n6t. However, each individual C18:1 trans molecular species (i.e., C18:1 D6 through D13) could not be separated but appeared as two blended peaks that eluted just before oleic acid. The areas of these two peaks were summed and referred to as a C18:1 trans. 2.4.4. Targeted Metabolomics Statistical Methods Due to non-normal distributions, acylcarnitine, bile acid, and RBC fatty acid data were subjected to Mann Whitney U Tests (non-parametric, between groups) and Kruskal-Wallis ANOVA (non-parametric, select findings) (OriginPro 2024 (64-bit) SR1 ver 10.1.0.178). Data were Log[10] normalized and presented as split violin plots to enhance visualization. 2.5. Untargeted Metabolomics Untargeted metabolomics analyses were conducted on stool, plasma, and urine samples. Stool samples were homogenized by sonification after adding 1 × PBS (10 mL per mg tissue) prior to prepping. Plasma, urine, and stool homogenates were deproteinized with 6× volume of cold acetonitrile:methanol (1:1) solution following the addition of 13C6-phenylalanine (3 µL at 250 ng/)µL as internal standard. Samples were kept on ice with intermittent vortexing and centrifuged at 18,000× g for 30 min at 4 °C. The supernatants were divided into 2 aliquots and dried down for analysis on a Quadruple Time-of-Flight Mass Spectrometer (Agilent Technologies 6550 Q-TOF, San Diego, CA, USA) coupled with an Ultra High-Pressure Liquid Chromatograph (1290 Infinity UHPLC Agilent Technologies, San Diego, CA, USA). Profiling data were acquired under both positive and negative electrospray ionization conditions over a mass range of 100–1200 m/z at a resolution of 10,000–35,000 (separate runs). Metabolite separation was achieved using two columns of differing polarity, a hydrophilic interaction column (HILIC, ethylene-bridged hybrid 2.1 × 150 mm, 1.7 mm; Waters) and a reversed-phase C18 column (high-strength silica 2.1 × 150 mm, 1.8 mm; Waters). For each column, the run time is 20 min using a flow rate of 400/µL min. A total of four runs per sample was performed to give maximum coverage of metabolites. A quality control sample, made up of a subset of samples from the study, was injected several times during each run. All raw data files obtained were converted to compound exchange file format using Masshunter DA reprocessor software (Agilent, San Diego, CA, USA). Mass Profiler Professional (Agilent, San Diego, CA, USA) was used for data alignment and to convert each metabolite feature (m/z × intensity × time) into a matrix of detected peaks for compound identification. Putative Identifications (IDs) were given to components based on mass molecular weights and further examined by comparison to a reference standard of the proposed compound. The mass accuracy of the Q-TOF method was <5 ppm with retention time precision better than 0.2%. A 1.2× fold change can be detected with a precision of 4%. 13C6 phenylalanine internal standard was used to check for recovery of each sample only and not in normalization of the data. An unsupervised principal component analysis, ANOVA, 3D plot and heat map, and a Partial Least Square discrimination analysis (PLS-DA) comparison between groups were generated for analysis. The R-package MetaboanalystR software (version 3.3.0) was used for data normalization, differential expression analysis, and visualization. In each mode, metabolites were row-wise normalized to a constant sum (SumNorm), log-transformed, and then scaled by mean centering (MeanCenter). Normalized data were analyzed by multivariate Principal Component Analysis (PCA) to reveal data heterogeneity, groupings, outliers, and trends. Hierarchical clustering analysis (HCA) was performed to reveal clustering between sample injections, group replicates, and multiple metabolite clusters that correlate with different subsets of clinical variables. The univariate Student’s unpaired t-test was conducted for between-group analysis with multiple testing corrections to identify differentially expressed metabolites between two groups with statistical significance (FDR adjusted p-value ≤ 0.05 and |fold change| ≥ 1.5). 2.6. Gut Metagenomics Gut metagenomics analyses were performed at CosmosID (Germantown, MD, USA). DNA from stool samples was isolated using the QIAGEN DNeasy PowerSoil Pro Kit, according to the manufacturer’s protocol. DNA samples were quantified using the GloMax Plate Reader System (Promega, Madison, WI, USA) using the QuantiFluor^® dsDNA System (Promega, Madison, WI, USA) chemistry. DNA libraries were prepared using the Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA) and IDT Unique Dual Indexes with a total DNA input of 1 ng. Genomic DNA was fragmented using a proportional amount of Illumina Nextera XT fragmentation enzyme. Unique dual indexes were added to each sample followed by 12 cycles of PCR to construct libraries. DNA libraries were purified using AMpure magnetic beads (Beckman Coulter) and eluted in QIAGEN EB buffer. DNA libraries were quantified using Qubit 4 fluorometer and Qubit™ dsDNA HS Assay Kit. Libraries were then sequenced on an Illumina NextSeq 2000 platform 2 × 150b. Sequences were trimmed by Trimmomatic [[89]29], and then aligned to human genome reference using BWA [[90]19]. Taxonomic annotation was performed by utilizing KrakenUniq [[91]30] and subsequently Bracken [[92]31] on a database that includes all bacterial, archaeal, viral, and fungal references from RefSeq along with human references. The