Abstract

   A pangenome is the sum of the genetic information of all individuals in
   a species or a population. Genomics research has been gradually shifted
   to a paradigm using a pangenome as the reference. However, in disease
   genomics study, pangenome-based analysis is still in its infancy. In
   this study, we introduced a graph-based pangenome GGCPan from 185
   patients with gastric cancer. We then systematically compared the
   cancer genomics study results using GGCPan, a linear pangenome GCPan,
   and the human reference genome as the reference. For small variant
   detection and microsatellite instability status identification, there
   is little difference in using three different genomes. Using GGCPan as
   the reference had a significant advantage in structural variant
   identification. A total of 24 candidate gastric cancer driver genes
   were detected using three different reference genomes, of which eight
   were common and five were detected only based on pangenomes. Our
   results showed that disease-specific pangenome as a reference is
   promising and a whole set of tools are still to be developed or
   improved for disease genomics study in the pangenome era.

Introduction

   The availability of reference genomes has been the foundation of
   genomics research for the past decade. However, as reports of the
   non-reference sequences and genes continue to increase across a wide
   range of species, it is becoming increasingly clear that a single
   genome is insufficient to represent the entire landscape of sequence
   diversity within a species ([34]Tao et al, 2020). The human reference
   genome, for example, is currently structured as a linear complex of
   haplotypes from more than 20 individuals, with 70% of the sequences
   coming from a single individual. Its framework is biased and erroneous
   and is not representative of global human genome variation ([35]Wang et
   al, 2022). For example, identification of structural variants (usually
   variant length above 50 bps) relies on detecting patterns of discordant
   read pairs or split read alignments, which in turn depends on the
   accuracy of read mapping. If reads are too short to cover long
   repetitive regions of the genome, then assembling and detecting these
   structural variants is difficult. The limitations of short reads and
   the bias of the reference genome mean that we may be missing more than
   70% of the structural variation in traditional whole-genome sequencing
   studies ([36]Wang et al, 2022). As a result of the shortcomings of
   traditional genomes, pangenomes were born.

   The concept of pangenomes was introduced in 2005 and has been widely
   used in bacteria, fungi, plants, and animals ([37]Tettelin et al, 2005;
   [38]Li et al, 2010; [39]Li et al, 2014; [40]Wang et al, 2018; [41]Li et
   al, 2019; [42]Sherman et al, 2019; [43]Tian et al, 2020). As the cost
   of sequencing decreases, the human pangenome is also improving and
   developing. New sequences ranging from 0.3 to 296 Mb in size have been
   discovered in different populations ([44]Sherman & Salzberg, 2020). The
   current pangenomes mainly contain two broad categories: linear
   pangenomes and graph pangenomes. The linear pangenome contains a
   traditional reference genome and extra non-reference genome sequences.
   Most linear pangenomes do not offer the location information of the
   non-reference sequences, which leads to a result that most aligners
   simply treat the non-reference sequences as additional sequences tacked
   onto the genome. In addition, non-reference sequences are obtained by
   selecting representative sequences, which also lose some or even most
   of the unique information of individuals. Now the graph pangenome is in
   a form of new sequences embedded in the reference genome, and the
   different sequences among individual genomes are represented as new
   nodes, which keep the positional information of new sequences and the
   information of each individual. A study demonstrated that whereas
   graph-based mapping yields higher accuracy than linear alignment on
   reads that contain known variants, linear genome alignment is superior
   when the reads do not contain variants ([45]Grytten et al, 2020). At
   present, the graph pangenome has many applications in the field of
   genomics study of plants and animals, such as humans, cattle, tomatoes,
   cucumbers ([46]Hadi et al, 2020; [47]Li et al, 2022; [48]Talenti et al,
   2022; [49]Zhou et al, 2022; [50]Liao et al, 2023; [51]Shi et al, 2023).

   Currently, the study of the human pangenome in the medical field is
   still in its infancy. One example of the application of the pangenome
   to the field of oncology is a previously published study on gastric
   tumors, which constructed a linear gastric cancer–specific pangenome
   called GCPan using whole-genome sequencing data from 185 Chinese
   gastric tumor patients ([52]Yu et al, 2022). In this study, we built a
   graph pangenome construction pipeline, based on which we constructed
   the Chinese gastric cancer graph–based pangenome called GGCPan. Then,
   we performed variant detection based on two Chinese gastric pangenomes
   (GCPan and GGCPan) and the reference genome GRCh38 in 185 gastric
   cancer patients. We hope to quantify the effect of different genomes in
   the disease data, what are the similarities and differences of the
   results compared with the traditional reference genome-based tumor
   analysis process, and what are the advantages and disadvantages of
   each, as well as to find some gastric cancer driver genes that exist
   only in the Chinese pangenome, to improve the gastric tumor diagnosis
   and treatment, and even to promote the development of precision
   medicine.

Results

Construction of GGCPan

   We aligned the assembly genome sequences of tumor and normal tissues of
   185 patients (contigs of 500 bps or more) to GRCh38, respectively.
   Then, we extracted the contigs that were aligned to unique positions on
   the reference genome. Based on the alignment result, we detected
   3,632–4,682 structural variants (variant length more than 50 bps) in
   each sample ([53]Fig S1B). After merging, a total of 39,605 structural
   variants were detected in the 185 samples. Finally, these variants were
   embedded into GRCh38 to construct the graph pangenome of gastric cancer
   samples (see the Materials and Methods section, [54]Fig S1A). We named
   the gastric cancer graph pangenome as GGCPan.

Figure S1. Construction of GGCPan and comparison of mapping rate using three
different reference genomes.

   [55]Figure S1.
   [56]Open in a new tab

   (A) Construction pipeline of gastric cancer graph pangenome GGCPan. (B)
   Histogram of the distribution of the number of SVs detected in the 185
   samples. The SVs are detected by paftools.js based on minimap2
   alignment results and are applied to construct the GGCPan. (C) Pipeline
   of non-reference sequences of GCPan aligned to GGCPan. (D) Read mapping
   rates of 185 gastric tumor samples on three reference genomes.

Read alignment rate comparison using three reference genomes

   We aligned the cancer and paracancer sequencing reads of 185 patients
   to three reference genomes GRCh38, GCPan, and GGCPan, respectively, and
   detected SNPs, indels, and SVs based on the alignment results,
   respectively (see the Materials and Methods section). A downstream
   comparative analysis was then performed.

   We compared the mapping rates of reads aligned to different genomes for
   185 patients. We found that using both pangenomes significantly
   improved the overall mapping rate of reads compared with the results
   using GRCh38 as the reference ([57]Fig S1D). In particular, the mapping
   rate of paired-end reads is higher using GGCPan than using GCPan. We
   believe that it is because we anchor the location of the novel sequence
   in GGCPan, which avoids the problem of soft cuts and gaps of the reads
   during the alignment process, and ensure that paired-end reads are
   aligned at the same position. Although GCPan includes sequences that
   are not contained in GRCh38, their chromosomal position is unknown. In
   addition, some non-GRCh38 sequences are highly repetitive. There will
   be a certain percentage of paired reads aligned to different positions,
   which will lead to a decrease in mapping quality and affect variant
   detection.

   There are 35,488 non-reference sequences in GCPan. To compare the
   non-reference sequences in GCPan and GGCPan, we aligned the 35,488
   non-reference sequences found by GCPan to GGCPan ([58]Fig S1C). We
   retained alignment results with a mapping quality greater than 30.
   Overall, 60% (21,318) of the new sequences could be aligned to GGCPan
   if we set the sequence identity at 90%, and 88% (31,254) of the new
   sequences could be aligned to GGCPan if we set the sequence identity at
   80%. This suggests that GGCPan contains at least 60% of new sequences
   detected by GCPan.

GGCPan has advantages in the detection of structural variants

   We want to compare the performance difference in structural variant
   detection using three different reference genomes. We firstly evaluated
   the performance of three structural variant detection tools Manta
   ([59]Chen et al, 2016), Delly ([60]Rausch et al, 2012), and SVaBa
   ([61]Wala et al, 2018) ([62]Supplemental Data 1^ (41KB, docx) , [63]Fig
   S2A, C, and D, Table S1) ([64]Rausch et al, 2012; [65]Chen et al, 2016;
   [66]Wala et al, 2018; [67]Kosugi et al, 2019; [68]Hickey et al, 2020,
   [69]2024; [70]Zook et al, 2020). The three tools were designed for
   linear genomes and evaluated well in a previous study ([71]Kosugi et
   al, 2019). We finally chose one tool that performed best to detect
   structural variants using linear genomes. We randomly selected five
   samples from the 185 samples and simulated five whole-genome sequencing
   samples that contain the structural variants in these five genomes. The
   reads are paired-end with a sequencing depth of 30× and read length of
   150 bps. We named these simulated data as SimuA. We aligned the reads
   of SimuA to three reference genomes and then detected the structural
   variants (see the Materials and Methods section). We calculated the
   mean precision, recall, and f1 values of the five samples ([72]Fig 1A;
   see the Materials and Methods section). The precisions of GRCh38,
   GCPan, and GGCPan are 95.30%, 96.34%, and 91.71%, which do not differ
   too much. GGCPan is slightly lower than the other two linear genomes.
   The recalls of GRCh38, GCPan, and GGCPan are 71.28%, 61.02%, and
   82.70%. The recalls show that GGCPan captures the highest number of
   true SVs, which is 10–20% more than the other two linear genomes. The
   recall of GCPan-based SV identification is about 10% lower than that of
   GRCh38-based. Almost all (99%) of the SVs detected using GCPan are
   included in SVs detected using GRCh38, and 15% more SVs were detected
   using GRCh38 as the reference than those based on GCPan ([73]Fig S3A).
   It might be caused by some reads aligned to the non-reference sequences
   of GCPan, resulting in a lower number of SVs in the GRCh38 region. Two
   structural variant examples, an insertion and a deletion, are listed
   here. For the insertion detected based on GRCh38 rather than on GCPan,
   more reads were aligned to the location using GRCh38, whereas the same
   reads were aligned to the non-reference sequences using GCPan as the
   reference ([74]Fig S3B). The situation of the deletion is similar
   ([75]Fig S3C). These two examples show that non-reference sequences
   relative to GRCh38 are insertions, but traditional variant detection
   tools cannot define non-reference sequences contained in each sample as
   variants. Based on the evaluation results, we can intuitively see that
   GGCPan is able to balance accuracy and completeness when detecting
   structural variants based on short reads, whereas linear genomes miss a
   lot of true positives, which is also prevalent with other tools
   ([76]Kosugi et al, 2019). We also performed an evaluation using the
   GIAB real data and got similar conclusion ([77]Supplemental Data 1^
   (41KB, docx) ) ([78]Hickey et al, 2020, [79]2024; [80]Zook et al,
   2020).

Figure S2. Performance comparison of structural variant detection using three
different reference genomes.

   [81]Figure S2.
   [82]Open in a new tab

   (A) Comparison of the performance of different tools using GRCh38 as
   the reference for structural variation detection using simulated data
   with different sequencing depths. (B) Effect of the completeness of the
   graph-modeled pangenome on its performance in detecting structural
   variants. The x-axis represents the number of samples to construct the
   graph-modeled pangenome. The five samples used for evaluation were
   excluded from the samples used to construct the five graph pangenomes.
   (C) Flowchart of the evaluation of different reference genomes and
   variant identification tools using sequencing data from the GIAB HG002
   sample. Different colors represent different identification pipelines.
   (D) Performance evaluation results for variant identification using
   different reference genomes.

Figure 1. Performance of structural variant detection using three different
reference genomes.

   [83]Figure 1.
   [84]Open in a new tab

   (A) Comparison of the performance of structural variant detection using
   three different reference genomes in simulated data. (B) Number of
   somatic structural variants detected using three reference genomes in
   real sequencing data from 185 patients. (C) Comparison of SVs detected
   using GRCh38 and GGCPan in 185 patients. (D) Comparison of SVs detected
   using GCPan and GGCPan in 185 patients. (C, D) “+” stands for presence
   and “−” for absence in (C, D). (E) Enriched pathways for SV-related
   genes. The SVs are detected using GGCPan in 185 samples. The size of
   the dot represents the number of related genes included in the pathway.

Figure S3. Comparison of structural variants in simulated data (SimuA) using
GCPan and GRCh38 as the references.