Abstract A pangenome is the sum of the genetic information of all individuals in a species or a population. Genomics research has been gradually shifted to a paradigm using a pangenome as the reference. However, in disease genomics study, pangenome-based analysis is still in its infancy. In this study, we introduced a graph-based pangenome GGCPan from 185 patients with gastric cancer. We then systematically compared the cancer genomics study results using GGCPan, a linear pangenome GCPan, and the human reference genome as the reference. For small variant detection and microsatellite instability status identification, there is little difference in using three different genomes. Using GGCPan as the reference had a significant advantage in structural variant identification. A total of 24 candidate gastric cancer driver genes were detected using three different reference genomes, of which eight were common and five were detected only based on pangenomes. Our results showed that disease-specific pangenome as a reference is promising and a whole set of tools are still to be developed or improved for disease genomics study in the pangenome era. Introduction The availability of reference genomes has been the foundation of genomics research for the past decade. However, as reports of the non-reference sequences and genes continue to increase across a wide range of species, it is becoming increasingly clear that a single genome is insufficient to represent the entire landscape of sequence diversity within a species ([34]Tao et al, 2020). The human reference genome, for example, is currently structured as a linear complex of haplotypes from more than 20 individuals, with 70% of the sequences coming from a single individual. Its framework is biased and erroneous and is not representative of global human genome variation ([35]Wang et al, 2022). For example, identification of structural variants (usually variant length above 50 bps) relies on detecting patterns of discordant read pairs or split read alignments, which in turn depends on the accuracy of read mapping. If reads are too short to cover long repetitive regions of the genome, then assembling and detecting these structural variants is difficult. The limitations of short reads and the bias of the reference genome mean that we may be missing more than 70% of the structural variation in traditional whole-genome sequencing studies ([36]Wang et al, 2022). As a result of the shortcomings of traditional genomes, pangenomes were born. The concept of pangenomes was introduced in 2005 and has been widely used in bacteria, fungi, plants, and animals ([37]Tettelin et al, 2005; [38]Li et al, 2010; [39]Li et al, 2014; [40]Wang et al, 2018; [41]Li et al, 2019; [42]Sherman et al, 2019; [43]Tian et al, 2020). As the cost of sequencing decreases, the human pangenome is also improving and developing. New sequences ranging from 0.3 to 296 Mb in size have been discovered in different populations ([44]Sherman & Salzberg, 2020). The current pangenomes mainly contain two broad categories: linear pangenomes and graph pangenomes. The linear pangenome contains a traditional reference genome and extra non-reference genome sequences. Most linear pangenomes do not offer the location information of the non-reference sequences, which leads to a result that most aligners simply treat the non-reference sequences as additional sequences tacked onto the genome. In addition, non-reference sequences are obtained by selecting representative sequences, which also lose some or even most of the unique information of individuals. Now the graph pangenome is in a form of new sequences embedded in the reference genome, and the different sequences among individual genomes are represented as new nodes, which keep the positional information of new sequences and the information of each individual. A study demonstrated that whereas graph-based mapping yields higher accuracy than linear alignment on reads that contain known variants, linear genome alignment is superior when the reads do not contain variants ([45]Grytten et al, 2020). At present, the graph pangenome has many applications in the field of genomics study of plants and animals, such as humans, cattle, tomatoes, cucumbers ([46]Hadi et al, 2020; [47]Li et al, 2022; [48]Talenti et al, 2022; [49]Zhou et al, 2022; [50]Liao et al, 2023; [51]Shi et al, 2023). Currently, the study of the human pangenome in the medical field is still in its infancy. One example of the application of the pangenome to the field of oncology is a previously published study on gastric tumors, which constructed a linear gastric cancer–specific pangenome called GCPan using whole-genome sequencing data from 185 Chinese gastric tumor patients ([52]Yu et al, 2022). In this study, we built a graph pangenome construction pipeline, based on which we constructed the Chinese gastric cancer graph–based pangenome called GGCPan. Then, we performed variant detection based on two Chinese gastric pangenomes (GCPan and GGCPan) and the reference genome GRCh38 in 185 gastric cancer patients. We hope to quantify the effect of different genomes in the disease data, what are the similarities and differences of the results compared with the traditional reference genome-based tumor analysis process, and what are the advantages and disadvantages of each, as well as to find some gastric cancer driver genes that exist only in the Chinese pangenome, to improve the gastric tumor diagnosis and treatment, and even to promote the development of precision medicine. Results Construction of GGCPan We aligned the assembly genome sequences of tumor and normal tissues of 185 patients (contigs of 500 bps or more) to GRCh38, respectively. Then, we extracted the contigs that were aligned to unique positions on the reference genome. Based on the alignment result, we detected 3,632–4,682 structural variants (variant length more than 50 bps) in each sample ([53]Fig S1B). After merging, a total of 39,605 structural variants were detected in the 185 samples. Finally, these variants were embedded into GRCh38 to construct the graph pangenome of gastric cancer samples (see the Materials and Methods section, [54]Fig S1A). We named the gastric cancer graph pangenome as GGCPan. Figure S1. Construction of GGCPan and comparison of mapping rate using three different reference genomes. [55]Figure S1. [56]Open in a new tab (A) Construction pipeline of gastric cancer graph pangenome GGCPan. (B) Histogram of the distribution of the number of SVs detected in the 185 samples. The SVs are detected by paftools.js based on minimap2 alignment results and are applied to construct the GGCPan. (C) Pipeline of non-reference sequences of GCPan aligned to GGCPan. (D) Read mapping rates of 185 gastric tumor samples on three reference genomes. Read alignment rate comparison using three reference genomes We aligned the cancer and paracancer sequencing reads of 185 patients to three reference genomes GRCh38, GCPan, and GGCPan, respectively, and detected SNPs, indels, and SVs based on the alignment results, respectively (see the Materials and Methods section). A downstream comparative analysis was then performed. We compared the mapping rates of reads aligned to different genomes for 185 patients. We found that using both pangenomes significantly improved the overall mapping rate of reads compared with the results using GRCh38 as the reference ([57]Fig S1D). In particular, the mapping rate of paired-end reads is higher using GGCPan than using GCPan. We believe that it is because we anchor the location of the novel sequence in GGCPan, which avoids the problem of soft cuts and gaps of the reads during the alignment process, and ensure that paired-end reads are aligned at the same position. Although GCPan includes sequences that are not contained in GRCh38, their chromosomal position is unknown. In addition, some non-GRCh38 sequences are highly repetitive. There will be a certain percentage of paired reads aligned to different positions, which will lead to a decrease in mapping quality and affect variant detection. There are 35,488 non-reference sequences in GCPan. To compare the non-reference sequences in GCPan and GGCPan, we aligned the 35,488 non-reference sequences found by GCPan to GGCPan ([58]Fig S1C). We retained alignment results with a mapping quality greater than 30. Overall, 60% (21,318) of the new sequences could be aligned to GGCPan if we set the sequence identity at 90%, and 88% (31,254) of the new sequences could be aligned to GGCPan if we set the sequence identity at 80%. This suggests that GGCPan contains at least 60% of new sequences detected by GCPan. GGCPan has advantages in the detection of structural variants We want to compare the performance difference in structural variant detection using three different reference genomes. We firstly evaluated the performance of three structural variant detection tools Manta ([59]Chen et al, 2016), Delly ([60]Rausch et al, 2012), and SVaBa ([61]Wala et al, 2018) ([62]Supplemental Data 1^ (41KB, docx) , [63]Fig S2A, C, and D, Table S1) ([64]Rausch et al, 2012; [65]Chen et al, 2016; [66]Wala et al, 2018; [67]Kosugi et al, 2019; [68]Hickey et al, 2020, [69]2024; [70]Zook et al, 2020). The three tools were designed for linear genomes and evaluated well in a previous study ([71]Kosugi et al, 2019). We finally chose one tool that performed best to detect structural variants using linear genomes. We randomly selected five samples from the 185 samples and simulated five whole-genome sequencing samples that contain the structural variants in these five genomes. The reads are paired-end with a sequencing depth of 30× and read length of 150 bps. We named these simulated data as SimuA. We aligned the reads of SimuA to three reference genomes and then detected the structural variants (see the Materials and Methods section). We calculated the mean precision, recall, and f1 values of the five samples ([72]Fig 1A; see the Materials and Methods section). The precisions of GRCh38, GCPan, and GGCPan are 95.30%, 96.34%, and 91.71%, which do not differ too much. GGCPan is slightly lower than the other two linear genomes. The recalls of GRCh38, GCPan, and GGCPan are 71.28%, 61.02%, and 82.70%. The recalls show that GGCPan captures the highest number of true SVs, which is 10–20% more than the other two linear genomes. The recall of GCPan-based SV identification is about 10% lower than that of GRCh38-based. Almost all (99%) of the SVs detected using GCPan are included in SVs detected using GRCh38, and 15% more SVs were detected using GRCh38 as the reference than those based on GCPan ([73]Fig S3A). It might be caused by some reads aligned to the non-reference sequences of GCPan, resulting in a lower number of SVs in the GRCh38 region. Two structural variant examples, an insertion and a deletion, are listed here. For the insertion detected based on GRCh38 rather than on GCPan, more reads were aligned to the location using GRCh38, whereas the same reads were aligned to the non-reference sequences using GCPan as the reference ([74]Fig S3B). The situation of the deletion is similar ([75]Fig S3C). These two examples show that non-reference sequences relative to GRCh38 are insertions, but traditional variant detection tools cannot define non-reference sequences contained in each sample as variants. Based on the evaluation results, we can intuitively see that GGCPan is able to balance accuracy and completeness when detecting structural variants based on short reads, whereas linear genomes miss a lot of true positives, which is also prevalent with other tools ([76]Kosugi et al, 2019). We also performed an evaluation using the GIAB real data and got similar conclusion ([77]Supplemental Data 1^ (41KB, docx) ) ([78]Hickey et al, 2020, [79]2024; [80]Zook et al, 2020). Figure S2. Performance comparison of structural variant detection using three different reference genomes. [81]Figure S2. [82]Open in a new tab (A) Comparison of the performance of different tools using GRCh38 as the reference for structural variation detection using simulated data with different sequencing depths. (B) Effect of the completeness of the graph-modeled pangenome on its performance in detecting structural variants. The x-axis represents the number of samples to construct the graph-modeled pangenome. The five samples used for evaluation were excluded from the samples used to construct the five graph pangenomes. (C) Flowchart of the evaluation of different reference genomes and variant identification tools using sequencing data from the GIAB HG002 sample. Different colors represent different identification pipelines. (D) Performance evaluation results for variant identification using different reference genomes. Figure 1. Performance of structural variant detection using three different reference genomes. [83]Figure 1. [84]Open in a new tab (A) Comparison of the performance of structural variant detection using three different reference genomes in simulated data. (B) Number of somatic structural variants detected using three reference genomes in real sequencing data from 185 patients. (C) Comparison of SVs detected using GRCh38 and GGCPan in 185 patients. (D) Comparison of SVs detected using GCPan and GGCPan in 185 patients. (C, D) “+” stands for presence and “−” for absence in (C, D). (E) Enriched pathways for SV-related genes. The SVs are detected using GGCPan in 185 samples. The size of the dot represents the number of related genes included in the pathway. Figure S3. Comparison of structural variants in simulated data (SimuA) using GCPan and GRCh38 as the references.