Abstract

   Genomic structural variants (SVs) are a major source of genetic
   diversity in humans. Here, through long-read sequencing of 945 Han
   Chinese genomes, we identify 111,288 SVs, including 24.56% unreported
   variants, many with predicted functional importance. By integrating
   human population-level phenotypic and multi-omics data as well as two
   humanized mouse models, we demonstrate the causal roles of two SVs: one
   SV that emerges at the common ancestor of modern humans, Neanderthals,
   and Denisovans in GSDMD for bone mineral density and one
   modern-human-specific SV in WWP2 impacting height, weight, fat,
   craniofacial phenotypes and immunity. Our results suggest that the
   GSDMD SV could serve as a rapid and cost-effective biomarker for
   assessing the risk of cisplatin-induced acute kidney injury. The
   functional conservation from human to mouse and widespread signals of
   positive natural selection suggest that both SVs likely influence local
   adaptation, phenotypic diversity, and disease susceptibility across
   diverse human populations.

   Subject terms: Functional genomics, Structural variation, Evolutionary
   genetics
     __________________________________________________________________

   Genetic studies of Chinese individuals have been performed, but mostly
   with short read sequencing, limiting the types of variants that can be
   identified. Here, the authors perform long read sequencing of 945 han
   Chinese individuals, finding structural variants under natural
   selection and those associated with human traits and evolutionary
   history.

Introduction

   Structural variants (SVs) are genomic alterations ≥50 base pairs (bps)
   in length that result from duplications, deletions, insertions,
   inversions, and translocations^[78]1–[79]3. The medical relevance of SV
   in the human genome has long been recognized, dating back to the advent
   of karyotyping techniques which enabled the identification of large
   chromosomal aneuploidies (~3 Mb or more) and heteromorphisms leading to
   various genetic syndromes (reviewed in refs. ^[80]4–[81]6). The
   advances in cytogenetic (such as fluorescence in situ hybridization,
   FISH) and molecular techniques, such as bacterial artificial chromosome
   (BAC) array-comparative genomic hybridization (array-CGH),
   representational oligonucleotide microarray analysis (ROMA) and single
   nucleotide polymorphism (SNP) array, have enabled finer-scale
   genome-wide investigation of SV across populations^[82]7,[83]8. These
   approaches allow the discovery of intermediate-size structural
   variants, ranging from kilobases to megabases in size, that are common
   among phenotypically normal individuals^[84]7,[85]8. Their findings
   also suggest that SV may account for an equal or even greater amount of
   genetic diversity in the human genome compared to SNPs which were
   previously thought to be the primary source of human phenotypic
   diversity and disease susceptibility differences^[86]7,[87]8. While
   FISH, array-CGH, and ROMA have made significant contributions to SV
   detection, they have inherent limitations in resolution, breakpoint,
   and complex SV detection^[88]4,[89]5,[90]9–[91]11.

   Short-read sequencing (SRS) technologies and development of the SV
   detection algorithms have greatly improved the identification and
   delineation of SVs and their precise breakpoints^[92]2,[93]12–[94]17.
   The 1000 Genomes Project was a pioneer SRS-based study that
   investigated the SV diversity of 2504 individuals in global
   populations, albeit at low coverage (on average ~7X)^[95]2,[96]17.
   Subsequently, numerous studies have utilized SRS-based technologies to
   investigate SV diversity in global populations^[97]12–[98]16. These
   studies have significantly deepened our understanding of SV diversity
   and their functional impacts in modern humans and suggested that the
   germline SVs affect a significantly greater number of nucleotides than
   SNPs^[99]17. However, they likely missed a large number of SVs, owing
   to limitations of SV calling based on SRS data (e.g., short read
   length, and difficulty in sequencing highly repetitive regions and
   regions with extreme GC content^[100]18,[101]19).

   The introduction of long-read single molecular sequencing (LRS)
   technologies, including Pacific Biosciences (PacBio) and Oxford
   Nanopore Technologies (ONT), has enabled increasingly comprehensive
   detection of SVs^[102]18,[103]19. With an average read length >10 kb,
   and sometimes extending to several million bps, LRS can encompass
   entire regions affected by SVs^[104]18,[105]19. For example, a survey
   of 32 individuals from 25 diverse populations (ethnicities) using LRS
   identified 107,136 SVs, among which only 29.6% could be detected using
   the SRS data for the same samples^[106]3. Notably, an increasing number
   of studies have made great strides in investigating SVs using LRS
   technologies, which has significantly enhanced our understanding of SV
   diversity^[107]1,[108]3,[109]20–[110]27 and their potential
   associations with human phenotypic variations^[111]3,[112]22,[113]23,
   local adaptation^[114]24–[115]26 and disease
   susceptibility^[116]3,[117]21–[118]23. However, functional validation
   using model systems is still needed to confirm the reported
   genotype-phenotype association signals attributed to SVs. A lack of
   understanding of the causal involvement of SVs in human local
   adaptation, phenotypic diversity, and disease susceptibility
   differences will not only hinder our understanding of human
   evolutionary history and phenotypic diversity but also constrain the
   development of personalized medicine.

   In this study, we constructed a long-read-based SV catalog of 945 Han
   Chinese samples. We conducted extensive orthogonal validations,
   characterized the frequency distribution and location of SVs, and
   identified a significant number of previously unreported variants.
   Moreover, our analyses suggested a multifaceted origin for the SVs in
   our cohort and some of them can be traced back to chimpanzees. Based on
   the analyses of population-level multi-omics data from an independent
   Han Chinese cohort as well as two humanized mouse models, we identified
   two causal SVs for human phenotypic diversity and disease
   susceptibility differences. These models also allowed us to identify
   phenotypes that were previously reported in mouse gene knockout
   experiments that did not replicate in humans or humanized mice. We also
   studied the origin of the causal variants and revealed their
   differentiated selective pressures among ancestrally diverse
   populations based on comparison to the data from the chimpanzee
   reference genome, four archaic hominids, and 3,201 diverse modern
   humans from the 1000 Genomes Project.

Results

Generation of a long-read-based SV dataset of the Han population

   We conducted whole-genome sequencing on 945 samples of Han ancestry
   using ONT with the PromethION platform. After base calling and quality
   control of the raw data (Methods), we obtained an average N50 of 16 kb
   and 50 Gb of clean reads per individual, equating to a mean sequencing
   depth of 17X (Supplementary Fig. [119]1, Supplementary Data [120]1). We
   employed NGMLR^[121]28 to align the clean reads to the human reference
   genome GRCh38.p13 with parameters specifically tailored for ONT reads
   to ensure precision and minimize potential biases. Our SV detection
   strategy involved a joint calling approach across the population.
   Initially, SVs were identified in each sample utilizing cuteSV^[122]29.
   Subsequently, SVs whose breakpoints within 500 bps of each other across
   different individuals were merged using SURVIVOR^[123]30 to avoid
   double counting of SVs at similar positions. Using this consolidated SV
   set, we re-genotyped each sample employing LRcaller^[124]23 and
   remerged the SVs, thereby obtaining a comprehensive, population-level
   SV dataset (Methods).

   We detected 111,288 SVs, encompassing 42,300 insertions, 49,518
   deletions, 13,503 duplications, 5595 inversions, and 372 translocations
   (Fig. [125]1a). In agreement with prior studies^[126]1,[127]12,[128]23,
   we observed distinct peaks at sizes of approximately 300, 2500, and
   6000 bps (Fig. [129]1b). These peaks likely correspond to the
   retrotransposition activities of Alu, SINE-VNTR-Alu (SVA), and LINE
   elements, respectively^[130]2,[131]12,[132]23. On average, we
   identified 23,729 SVs per sample, with counts ranging from 22,275 to
   26,763, consistent with previous studies based on LRS
   technologies^[133]1,[134]3,[135]23,[136]31 (Fig. [137]1c).
   Additionally, SVs collectively impacted an average of 17.83 Mb of
   genomic sequences per individual, with insertions and deletions
   accounting for 82.68% of the total sequences.

Fig. 1. Summary of the Structural variants (SVs) across the Han population.

   [138]Fig. 1
   [139]Open in a new tab

   a A total of 111,288 SVs, encompassing 42,300 insertions, 49,518
   deletions, 13,503 duplications, 5595 inversions, and 372
   translocations, were identified in the present study. b Distribution of
   SV size. Three peaks at sizes of approximately 300, 2500, and 6000 bps
   correspond to the retro-transposition activities of Alu, SINE-VNTR-Alu
   (SVA), and long interspersed element (LINE), respectively. The X- and
   Y-axes indicate the size and number of SVs, respectively. c Number of
   SVs per sample. Each bar represents a sample. We used color to indicate
   different types of SV per sample. d Number of reported and unreported
   SV according to comparisons with different studies. The X- and Y-axes
   represent the percentage and the datasets used for comparison,
   respectively. e Number of singleton and non-singleton SVs across
   samples. The X- and Y-axes indicate the number of samples and the
   number of singleton and non-singleton SVs when adding new samples,
   respectively. The decrease in singletons reached a plateau at a sample
   size >700, indicating that the present study captured most of the SV
   diversity in the Han population. f A four-fold higher SV density was
   observed in subtelomeric versus other genomic regions (binned at
   500 kb). The X-axis represents the distance between SVs and telomeres,
   while the Y-axis indicates the number of SVs. Each point represents an
   SV. g The distribution of the number of SVs across different allele
   counts (indicates allele frequencies). The X- and Y-axes indicate the
   allele count and number of SVs, respectively. h The sizes of rare SVs
   (n = 71,178) with minor allele frequencies <0.05 are significantly
   greater than those of common SVs (minor allele frequencies ≥ 0.05,
   n = 39,738). The X-axis indicates rare and common SVs, and the Y-axis
   shows the SV size. The box plot shows interquartile range (IQR), with
   the middle line indicating the median, and whiskers representing
   1.5-fold IQR. P value is calculated by a two-sided Wilcoxon rank sum
   test, and the exact value is shown in the graph.

Confirmation that the SV dataset is high-quality

   We assessed the false discovery rate in our dataset using orthogonal
   methods (polymerase chain reaction (PCR) and Sanger sequencing) and
   comparisons to SVs with allele frequency (AF) > 0.5 in global
   populations. We first employed PCR and Sanger sequencing to validate a
   set of 100 common SVs (AF > 0.05), encompassing 45 insertions, 45
   deletions, four inversions, and six duplications, using the DNA of
   samples from an independent cohort^[140]32. We first manually examined
   the SVs and their breakpoints using Integrative Genomics
   Viewer^[141]33. Despite nine SVs (including six insertions and three
   deletions) being visible using IGV viewer, the primers of these SVs did
   not work as they either contained repetitive sequences or had
   breakpoints located in repetitive regions, leading to difficulties in
   PCR amplification. The remaining 91 SVs were amplified using PCR,
   producing bands of the expected size. We submitted these 91 PCR
   products for Sanger sequencing, of which 83 were successfully validated
   (Methods, Supplementary Figs. [142]2, [143]3, Supplementary
   Data [144]2). The remaining eight PCR products failed Sanger
   sequencing, likely due to the presence of highly repetitive regions,
   poly-structures, or extreme GC content, which can interfere with Sanger
   sequencing (Supplementary Materials). Additionally, we adopted a method
   previously developed in a study on SV diversity in Icelanders^[145]23,
   allowing us to estimate false-positive and false-negative rate
   boundaries of approximately 3.11–3.97% and 3.13–3.54%, respectively,
   based on comparisons with high-frequency SVs (AF > 0.5) in global
   populations^[146]1,[147]3 (Methods). Collectively, our analyses
   strongly indicate that a substantial proportion of the SVs within our
   dataset represent genuine genomic polymorphisms within the Han Chinese
   population.

Discovery of a large number of unreported SVs

   We uncovered a substantial number of previously unreported SVs within
   our dataset (Fig. [148]1d). Using a threshold of reciprocal overlap
   rate of ≥50%, we identified 87,308 (78.45%) SVs that were not reported
   in the gnomAD project^[149]12, a high-coverage (average depth of
   coverage of 32X) SRS dataset of 14,891 global samples (Fig. [150]1d).
   In addition, 83,845 (75.34%) and 79,153 (71.12%) of the SVs in our
   dataset were not reported in two LRS-based studies of SV diversity of
   global samples^[151]1,[152]3 (Fig. [153]1d). Furthermore, 64,936
   (58.35%) SVs in the present study were not found in a study of SV
   diversity based on 405 Chinese samples using LRS technology^[154]22
   (Fig. [155]1d). Finally, 41,870 (37.62%) SVs have not been reported in
   the dbVar database^[156]34, which catalogs the SVs from a collection of
   219 studies utilizing multiple platforms. Overall, 24.56%
   (27,333/111,288) of the SVs in the present study had not been
   documented in previous studies (Fig. [157]1d). Among the unreported
   SVs, we identified 780 SVs were located in the exonic regions of 670
   genes, among which 182 SVs were predicted to cause open reading frame
   shifts, as well as 2714 in transcription factor binding sites, 1836 in
   strong enhancers, 614 in insulators based on the annotations of various
   cell lines from the ENCODE project^[158]35 using ANNOVAR^[159]36
   (Methods). As expected, we found that the AFs of unreported SVs (median
   AF = 0.003) were significantly lower (P value < 2.2e-16, two-sided
   Wilcoxon rank sum test) than the AFs of the reported SVs (median
   AF = 0.02).

   We computed the count of unique SVs (exclusively found in a single
   sample) across various sample sizes to assess the representativeness of
   identified SVs in the Han population^[160]1 (Fig. [161]1e). Notably, as
   the sample size expanded, the number of singleton SVs decreased
   continuously until reaching a plateau after ~700 samples
   (Fig. [162]1e). This observation indicates that the sample size
   employed in our study effectively encompasses the majority of common
   SVs within the genomes of the Han population.

Identification of the genomic features of SVs

   We investigated the distribution of SVs across the genome and observed
   a four-fold higher SV density within the subtelomeric regions (within
   5 Mb from the telomere regions) than in other regions (Fig. [163]1f),
   potentially related to the increased double-strand
   breakage^[164]1,[165]37 and recombination^[166]1,[167]38,[168]39 rates
   and biased gene conversion^[169]40 in subtelomeric regions. In
   addition, 79.72% of the SV breakpoints overlapped with repeat elements,
   such as tandem repeats (17.84%) and interspersed repeats (57.51%),
   indicating the crucial role of repeat elements in SV formation^[170]41.

   The average size of the SVs was 1754 bps, ranging from 50 to
   99,743 bps. In addition, 79.22% (87,868/110,916) of the SVs were
   <1000 bps. When comparing the sizes of different types of SVs, we
   observed that insertions (average length = 521 bps) were shorter on
   average than deletions (average length = 2340 bps), inversions (average
   length = 3251 bps), and duplications (average length = 2848 bps).

   We noted a negative correlation between allele frequency and the number
   of SVs (Fig. [171]1g), aligning with the allele frequency distribution
   observed for single nucleotide variations in human
   genomes^[172]42,[173]43. The rare (minor allele frequency [MAF] <5%)
   and common (MAF ≥ 5%) SVs comprised 64.22% and 35.78% of all identified
   SVs in this study, respectively. The sizes of rare SVs were
   significantly greater (P < 2.2e-16, two-sided Wilcoxon rank sum test)
   than those of common SVs (Fig. [174]1h), which is in agreement with
   observations from previous studies^[175]2,[176]44,[177]45. Among the
   common SVs, 2917 SVs, including 1183 deletions, 237 duplications, 1472
   insertions, 22 inversions, and three translocations, were fixed
   (AF = 1) or nearly fixed (AF > 0.9) in the population. After excluding
   three translocations that cannot be genotyped, we genotyped the rest of
   the high-frequency SVs using the high-coverage (30X) SRS data of 2503
   samples across 26 populations in the 1000 Genomes Project^[178]46 using
   Paragraph^[179]47. We observed that 96.51% of the SVs have allele
   frequency >5% in more than half of the populations (13 out of 26) in
   the 1000 Genomes Project. This indicates the presence of minor alleles
   at these loci in the reference genome^[180]1(Methods).

Distribution of Han-derived SV diversity in modern and ancient humans

   To investigate the SV diversity in the Han in a global context, we
   genotyped 110,863 SVs (after excluding 372 translocations and 53 SVs
   containing non-A/G/T/C bases in their alternative sequences in the
   human reference genome) in 2503 unrelated high-coverage SRS samples in
   the 1000 Genomes Project, four archaic hominin genomes (three
   Neanderthals and one Denisovan) using Paragraph^[181]47, and in 38
   samples that were sequenced with LRS technology from multiple
   studies^[182]1,[183]3,[184]20 using LRcaller^[185]23, as well as
   conducted a comparison with the SVs obtained based on LRS data of two
   chimpanzee genomes^[186]48 and 405 Chinese genomes^[187]22 (Methods).

   We find that approximately 2% (2233 SVs affecting 828 genes, Methods)
   of SVs are ancient polymorphisms shared between humans and chimpanzees,
   indicating that they originated before the divergence of these species
   (Fig. [188]2a). An additional 5% (5124 affecting 1692 genes) are shared
   between modern humans, Neanderthals, and Denisovans, suggesting these
   SVs likely emerged in the common ancestor of modern humans,
   Neanderthals, and Denisovans (Fig. [189]2a). Around 32% (35,649
   affecting 6981 genes) of the identified SVs appear to be modern
   human-specific, as they are present across diverse populations around
   the world but absent from archaic hominin genomes. These variants may
   have played important functional roles in the more recent evolution of
   anatomical and physiological traits unique to Homo sapiens.

Fig. 2. The distribution of Han-derived structural variant (SV) diversity in
chimpanzees, ancient, and modern humans.

   [190]Fig. 2
   [191]Open in a new tab

   a The diversity of SV in a phylogenetic context. The percentages on
   each branch indicate the proportion of SVs in the Han lineage estimated
   to have arisen during the given evolutionary period. The phylogeny is
   based on the results of previous studies^[192]2,[193]121–[194]126. b
   Numbers of SVs shared among chimpanzees, archaic, and modern humans. We
   displayed the top 20 with the highest number of SVs across groups.

   Within modern human groups, we identified SVs that are either shared
   across populations from different continents or specific to certain
   continents (Fig. [195]2a). For example, 0.6% (640 affecting 386 genes)
   are shared between Europeans, Americans, and Asians, 0.4% (391
   affecting 211 genes) are shared between Americans and Asians, and ~11%
   (12,533 affecting 5291 genes) are specific to East Asians including the
   Han in the present study (Fig. [196]2a). Intriguingly, a small
   proportion (~0.1%, 59 affecting 33 genes) of SVs are shared between
   modern humans and Neanderthals or Denisovans, but absent in Africans
   (Fig. [197]2b). Finally, approximately 20% (22,000) of identified SVs
   appear specific to our Han cohort and likely represent de novo
   mutations (Fig. [198]2a), as the majority of these variants are rare,
   with a median allele frequency of ~0.003, within the cohort.

Characterization of a deletion shared by modern and archaic humans in GSDMD
associated with acute kidney injury, bone mineral density, and levels of
sphingomyelin and phosphatidylcholine

   To explore the potential functional impacts of the SVs, we performed a
   gene-based annotation based on NCBI RefSeq annotations (as of August
   17, 2020) using AnnotSV^[199]49–[200]51. A total of 51.24%
   (57,024/111,288) and 44.13% (49,112/111,288) of the SVs were located in
   intergenic and intronic regions, respectively, while 4.63%
   (5152/111,288) of the SVs overlapped with at least one exon of 3326
   genes. Among the exonic SVs, 28.92% (1490/5152) can potentially disrupt
   gene function through open reading frame shifts. Based on a gene-based
   ontology enrichment analysis using the Database for Annotation,
   Visualization and Integrated Discovery (DAVID, updated on October 11,
   2023)^[201]52,[202]53 (Methods), we found that the genes with SVs
   located in their exonic regions were significantly enriched in the
   pathways involved in regulation of keratinization (FDR-adjusted
   P = 0.008), transcription (FDR-adjusted P = 0.009), and metal ion
   binding (FDR-adjusted P = 0.06) (Supplementary Data [203]3).

   Regarding common SVs (MAF ≥ 5%), the SV location pattern was similar to
   that of all SVs, with 3.75% (1492/39,817) of SVs located in the exonic
   regions of 814 genes. Gene set enrichment analysis conducted using
   DAVID indicated significant involvement of the genes with common SVs
   located in their exonic regions in pathways related to immunity (immune
   response, FDR-adjusted P = 0.01; defense response to Gram-negative
   bacterium, FDR-adjusted P = 0.04), keratinization (FDR-adjusted
   P = 0.01), gas transport (carbon dioxide transport, FDR-adjusted
   P = 0.08; oxygen transport, FDR-adjusted P = 0.097), sensory perception
   of taste (FDR-adjusted P = 0.08), and digestion (FDR-adjusted
   P = 0.097) (Supplementary Data [204]4). Among the immune-related genes,
   we identified a 2175-bps deletion (chr8:143,551,891‒143,554,066,
   GRCh38) with an allele frequency equal to 43% that eliminated the first
   exon of the longest isoform of GSDMD ([205]NM_001166237.1). The deleted
   region is enriched with signals of histone modifications and
   transcription factor binding sites (Fig. [206]3). It was also predicted
   to function as an enhancer to different isoforms of GSDMD (Fig. [207]3,
   Supplementary Fig. [208]4) according to the dual elite annotation from
   the GeneHancer database^[209]54, which consolidates high-confidence
   enhancer-gene associations from multiple sources, including Hi-C,
   expression quantitative trait locus, and enhancer information. GSDMD
   encodes a pore-forming effector protein that facilitates inflammatory
   cell death, also known as pyroptosis^[210]55.

Fig. 3. Genomic context around the deletion (indicated by the red box) at the
GSDMD locus.

   [211]Fig. 3
   [212]Open in a new tab

   The genomic context includes the Hi-C interaction sourced from Rao et
   al.^[213]114, annotations for promoters and enhancers from the
   GeneHancer database, H3K4me1, and H3K27ac ChIP-seq tracks for primary B
   cells, primary T cells and osteoblasts from the Roadmap Epigenomics
   dataset, H3K4me1, H3K27ac or CTCF ChIP-seq tracks for HUVEC, HMEC,
   NHEK, HeLa-S3, a layered H3K27ac track on 7 cell lines (GM12878,
   H1-hESC, HSMM, HUVEC, K562, NHEK and NHLF), and transcription factor
   (TF) binding sites tracks for Liver, K562, HepG2 and GM12878 from
   ENCODE data set. All the annotations were visualized in the UCSC Genome
   Browser.

   Luciferase reporter assays were conducted in HeLa cells to validate the
   enhancer activity of the region affected by the deletion (Methods). Our
   experiments demonstrated a significant increase in luciferase signal
   (P < 0.0001, two-tailed t-test) when using plasmids containing the
   sequence of the deleted region compared to empty vectors after
   normalization. This finding indicates that the region affected by the
   deletion is likely to function as an enhancer (Fig. [214]4a, b). In
   agreement, the homozygous carriers of the deletion exhibited
   significantly reduced GSDMD expression (P < 0.01, one-way analysis of
   variance (ANOVA)) compared to heterozygous carriers, as determined by
   RT-qPCR (Fig. [215]4c). In summary, our findings suggest that the
   deletion depletes the first exon of the longest GSDMD isoform,
   resulting in the downregulation of GSDMD through the disruption of
   their enhancer activity.

Fig. 4. A deletion (DEL) at the GSDMD locus is associated with bone mineral
density (BMD) and acute kidney injury (AKI) risk in mice and humans.

   [216]Fig. 4
   [217]Open in a new tab

   a Schematic of the study of the functional impacts of the deletion at
   the GSDMD locus, created with BioRender.com. b Relative luciferase
   levels in HeLa cells transfected with empty vector (EV) or recombinant
   vector containing the DEL sequence (+DEL). The dots represent
   biological replicates (n = 6). c RT-qPCR assessment of the mRNA
   expression of GSDMD in three human samples. Each bar represents a
   sample, with 0/1 and 1/1 indicating heterozygous and homozygous
   deletion carriers, respectively. The dots represent technical
   replicates (n = 6). d RT-qPCR assessment of the mRNA expression of
   Gsdmd in enhancer knockout (KO) mice (n = 3–4) and wild-type
   littermates (n = 4) in spleen, lung, bone, and renal tube. e The
   micro-CT analysis of Gsdmd enhancer KO mice (n = 3) and wild-type
   littermates (n = 3). BMD and trabecular traits, including trabecular
   thickness (Tb. Th), trabecular number (Tb. N), trabecular separation
   (Tb. Sp), and bone volume/tissue volume (BV/TV), were measured. f
   Violin plot of BMD in human carriers (0/1: n = 313; 1/1: n = 114) and
   non- carriers (0/0: n = 181) of the deletion. Box plots show
   interquartile range (IQR), with the middle line indicating the median,
   and whiskers representing 1.5-fold IQR. g RT-qPCR assessment of the
   mRNA expression of four AKI markers, including Lcn2, Cd68, Il-6, and
   Tnfα in Gsdmd enhancer KO mice and wild-type littermates with or
   without cisplatin administration (n = 3–5). h Representative images
   (n = 5) of H&E- and TUNEL-stained kidney sections from Gsdmd enhancer
   KO mice and wild-type littermates following saline or cisplatin
   injection. H&E Hematoxylin and eosin, TUNEL terminal deoxynucleotidyl
   transferase dUTP nick-end labeling. Scale bar: 50 μm (H&E); 20 μm
   (TUNEL). All data are represented as the mean ± SEM. P values are
   calculated by one-way ANOVA for (c, g), by two-tailed t-test for (b, d,
   e), and by one-way Kruskal–Wallis test for (f), and the exact values
   are shown in the graph.

   To investigate the functional impacts of the deletion at the GSDMD
   locus in vivo, we employed CRISPR/Cas9 technology to generate a
   targeted deletion of a 1410-bps homologous region in the mouse genome
   (Fig. [218]4a; Methods). The knockout (KO) mice were born at the
   expected Mendelian ratio and grew normally. In line with findings
   observed in humans (Fig. [219]4c), RT-qPCR analyses revealed that the
   excision of the homologous region in mice resulted in significant
   reductions (P < 0.05, two-tailed t-test) in Gsdmd expression levels
   compared to those in their wild-type (WT) littermates across a spectrum
   of tissues, encompassing the spleen, lung, bone, and renal tube
   (Fig. [220]4d). These results indicate that the deleted region is
   likely to act as an enhancer regulating Gsdmd expression in the mouse
   genome.

   Notably, contrary to a decrease in bone mineral density (BMD) when
   knocking out Gsdmd gene in mice^[221]56, we observed the opposite
   effect in humanized mouse models and human carriers of the deletion.
   Humanized Gsdmd enhancer KO mice exhibited increases in BMD, trabecular
   thickness (Tb.Th), and trabecular number (Tb.N), along with reduced
   trabecular separation (Tb.Sp), relative to wild-type littermates
   (P < 0.05, two-tailed t-test, Fig. [222]4e). However, no significant
   difference occurred in the bone volume/tissue volume (BV/TV) of
   proximal tibial trabeculae between the Gsdmd enhancer KO mice and
   wild-type littermates (Fig. [223]4e). Additionally, humans with the
   Gsdmd structural variant showed significantly increased BMD compared to
   those without the variant (P < 0.0001, one-way Kruskal–Wallis test,
   Fig. [224]4f).

   Our results suggest the SV in GSDMD could potentially serve as a
   biomarker to evaluate risk for multiple diseases. For example, we
   observed that introducing the human mutation into mouse model confers a
   protective effect against cisplatin-induced acute kidney injury,
   evidencing for significant decreases in the expression levels of four
   key acute kidney injury markers, namely, Lcn2, Cd68, Il-6, and Tnfα, in
   Gsdmd enhancer KO mice versus wild-type littermates after cisplatin
   administration (Fig. [225]4g). This observation was further
   substantiated by histological analysis, which demonstrated reduced
   tubular dilation and immune cell infiltration in the enhancer KO mice
   (Fig. [226]4h) and decreased cell death compared to their wild-type
   littermates (Fig. [227]4h).

   We also observed the genotype-phenotype associations that have not been
   reported in previous Gsdmd KO mouse models. The carriers of the SV have
   significant increases of 13 phosphatidylcholines (P < 0.05, two-sided
   Wilcoxon rank sum test) and one sphingomyelin (P < 0.05, two-sided
   Wilcoxon rank sum test) than the non-carriers (Supplementary
   Fig. [228]5). Previous studies showed that sphingomyelin and
   phosphatidylcholine are not only crucial metabolites for maintaining
   cell normal function^[229]57–[230]60 but also risk markers of multiple
   diseases. For example, the dysregulated sphingomyelin is an independent
   cardiovascular disease risk factor of atherogenesis^[231]58. Altered
   phosphatidylcholines levels have also been linked to obesity,
   atherosclerosis, and insulin resistance via effects on tissue
   composition^[232]61.

   We next studied the evolution of the deletion at the GSDMD locus in
   humans based on the genotyping results of the high-coverage SRS data of
   3201 samples from 26 populations in the 1000 Genomes Project^[233]46
   (Fig. [234]5a and Supplementary Data [235]5) and samples from four
   archaic humans^[236]62–[237]65 (Fig. [238]5b) as well as the alignment
   of the human and chimpanzee reference genomes at this locus (Methods)
   and manual examination. Our analyses revealed that the deletion is
   absent in the chimpanzee reference genome (Supplementary Fig. [239]6a)
   and is present in genomes of three Neanderthals who are likely to be
   homozygous carriers and one Denisovan sample who is likely to be
   heterozygous carrier based on depth of coverage (Fig. [240]5b). In
   addition, it exhibits rarity (average AF = 0.03) within the African
   ancestry group but is common (average AF > 0.1) across all non-African
   populations of the 1000 Genomes Project (Fig. [241]5a) except for the
   Peruvians with AF = 0.02. Based on genotyping 102,118 SVs from a
   previous study of SV diversity in the global populations using LRS
   data^[242]3, we identified signatures of positive selection at this
   deletion in multiple non-African populations when calculating F[ST]
   (the fixation index) between non-African and African populations in the
   1000 Genomes project (Supplementary Fig. [243]7). The deletion ranks in
   the top 2% of SVs exhibiting the highest F[ST] values between the Han
   and Yoruba (Supplementary Fig. [244]7b; Methods), aligning with the
   substantial differential frequencies observed between these populations
   (Fig. [245]5a). Collectively, our results suggest that this deletion is
   likely to have evolved in the common ancestor of modern humans
   (Fig. [246]5b), Neanderthals, and Denisovans and then is likely to
   undergo positive selection in non-African populations.

Fig. 5. Evolution and demographic distribution of the deletion at the GSDMD
locus.

   [247]Fig. 5
   [248]Open in a new tab

   a The allele frequency of the deletion (indicated in green) in the
   genomes of 3201 samples from 26 populations in the 1000 Genomes
   Project^[249]46. In the pie charts, the frequencies of the reference
   and deletion alleles are depicted in orange and green, respectively. b
   The read mapping results around the deletion (indicated by the red box)
   at the GSDMD locus in the genomes of six representative modern humans
   and four archaic humans. The Han Chinese sample who carries a
   homozygous deletion is from the present study. We obtained the
   high-coverage SRS data of one Chinese-Dai individual in Xishuangbanna
   (CDX, ID: HG01046) who carries a homozygous deletion, one Gujarati
   Indians in Houston, TX (GIH, ID: NA20854) who carries a heterozygous
   deletion, one Toscani in Italy (TSI, ID: NA20581) sample who carries a
   homozygous deletion, one Colombian in Medellin, Colombia (CLM, ID:
   HG01261) who carries a heterozygous deletion, and one Luhya in Webuye,
   Kenya (LWK, ID: NA19323) who carries a heterozygous deletion. The
   region around the The phylogeny is based on the results of previous
   studies^[250]2,[251]121–[252]126. Due to its presence in the genomes of
   all modern human populations, three Neanderthals and one Denisovan, the
   deletion is likely to have emerged in the common ancestor of modern
   humans, Neanderthals, and Denisovans (highlighted by the red arrow).
   IBS Iberian populations in Spain, GBR British in England and Scotland,
   FIN Finnish in Finland, PJL Punjabi in Lahore, Pakistan, ITU Indian
   Telugu in the UK, STU Sri Lankan Tamil in the UK, BEB Bengali in
   Bangladesh, CHS Han Chinese South, CHB Han Chinese in Beijing, China,
   JPT Japanese in Tokyo, Japan, KHV Kinh in Ho Chi Minh City, Vietnam,
   CEU Utah residents (CEPH) with Northern and Western European, MXL
   Mexican Ancestry in Los Angeles, California, PUR Puerto Rican in Puerto
   Rico, PEL Peruvian in Lima, Peru, ACB African Caribbean in Barbados,
   ASW African Ancestry in Southwest US, ESN Esan in Nigeria, GWD Gambian
   in Western Division, The Gambia - Mandinka, MSL Mende in Sierra Leone,
   YRI Yoruba in Ibadan, Nigeria.

An unreported complex SV specific to modern humans in WWP2 is associated with
shorter stature, increased body fat percentage, and enhanced immune response

   We next investigated the regulatory potential of SVs, considering that
   95.37% were located in intronic and intergenic regions, known to be
   enriched with regulatory elements. Our analysis revealed that 3.53% of
   the SVs entirely overlapped with 5666 enhancers, as annotated in the
   GeneHancer database^[253]54. Regarding the common SVs (MAF ≥ 5%), 116
   completely covered 158 enhancers regulating the expression of 277
   protein-coding genes according to GeneHancer database annotations. A
   gene-based enrichment test using DAVID suggested that these genes are
   significantly involved in pathways related to the composition of the
   cytoplasm (cytoplasm, FDR-adjusted P = 0.08, Supplementary
   Data [254]6).

   We further explored the functional implications of the genes
   potentially regulated by the common SVs, harnessing the comprehensive
   resources of the Mouse Genome Informatics (MGI) website^[255]66. MGI
   catalogs comprehensive functional, phenotypic, and disease annotations
   of mouse genes, primarily derived from gene knockout experiments. Our
   investigation identified 121 genes within the mouse genome that are
   orthologous to the 277 genes regulated by SVs observed in humans
   according to human-mouse orthologous gene annotations in the MGD
   database^[256]66 (Methods). After excluding a union of 24 genes linked
   to lethality (23 genes) and had no discernible phenotypic consequences
   (two genes), as determined from knockout experiments in mice (Methods),
   the remaining 97 genes were demonstrated to likely exert influence
   across 833 distinct ontologies in mice. Notably, several top-ranking
   ontologies pertained to phenotypic attributes (such as decreased body
   weight, size, litter size, and postnatal growth retardation),
   physiological traits (such as impaired glucose tolerance and decreased
   susceptibility to diet-induced obesity), fertility (such as
   oligozoospermia, asthenozoospermia, and male infertility), and immune
   system physiology (Supplementary Data [257]7). For example, we
   identified a complex common SV (AF = 0.28)—a 229-bps insertion (caused
   by the fusion of two SINE elements ~4225 bps apart from each other,
   Figs. [258]6 and [259]7) followed by a 354-bps deletion—located within
   the fourth intron of WWP2 ([260]NM_001270454.2) that is associated with
   multiple phenotypic (e.g., body weight and size and craniofacial
   traits) and immunological traits based on KO experiments in mice.
   However, the phenotypes reported in different studies are inconsistent.
   For example, one study observed reduced body length and weight along
   with abnormal teeth and craniofacial traits^[261]67. However, these
   phenotypes were not replicable in other studies^[262]68,[263]69. These
   discrepancies impede the translation of phenotypes from mouse knockout
   experiments to humans and emphasize the need to validate model organism
   phenotypes using human genetic data. Additionally, while the
   dysregulation of WWP2 has been implicated in the increased risks of
   cardiovascular diseases^[264]70,[265]71, osteoarthritis^[266]72, and an
   emerging oncogene (reviewed in ref. ^[267]73), no causal variant has
   yet been identified that regulates WWP2 and its associated phenotypes.

Fig. 6. Genomic context around the 1127-bps enhancer (indicated by the red
box) at the WWP2 locus.

   [268]Fig. 6
   [269]Open in a new tab

   The genomic context includes Hi-C interaction sourced from Rao et
   al.^[270]114, annotations for promoters and enhancers from the
   GeneHancer database, H3K4me1 and H3K27ac ChIP-seq tracks for Bone
   derived MSCs (E026), Mesenchymal stem cell derived chondrocyte (E049)
   and osteoblasts (E129) from the Roadmap Epigenomics dataset, H3K4me1,
   H3K27ac ChIP-seq tracks for HMEC, HepG2, Osteoblasts, a layered H3K27ac
   track on 7 cell lines (GM12878, H1-hESC, HSMM, HUVEC, K562, NHEK and
   NHLF), and transcription factor (TF) binding sites tracks for HepG2
   from ENCODE data set. All the annotations were visualized in the UCSC
   Genome Browser.

Fig. 7. A complex SV at the WWP2 locus is associated with height, body fat
percentage, craniofacial traits, and immune response.

   [271]Fig. 7
   [272]Open in a new tab

   a Schematic of the study of the functional impacts of the SV at the
   WWP2 locus. b The SV consists of an insertion mediated by the fusion of
   two SINE elements followed by a 354-bps deletion. c Dot plot of the
   sequences of the carriers of the SV (X-axis) and the human reference
   genome (Y-axis). d The complex SV significantly reduces the enhancer
   activity according to luciferase reporter assays in HeLa. The dots
   represent biological replicates (n = 6). e RT-qPCR assessment of the
   mRNA expression of WWP2 in three human samples. Each bar represents a
   sample, with 0/0 and 0/1 indicating noncarrier and heterozygous SV
   carriers, respectively. The dots represent technical replicates
   (n = 6). RT-qPCR assessment of the relative transcript levels of three
   Wwp2 isoforms (Wwp2-FL in (f), Wwp2-N in (g), and Wwp2-C in (h)) in
   multiple tissues in Wwp2 enhancer knockout (KO) mice and wild-type
   littermates. Body length, weight (i) and fat percentage (BF%) (j) in
   Wwp2 enhancer KO mice and wild-type littermates. k BF% in human SV
   carriers and non-carriers. l Visceral fat area in human SV carriers and
   non-carriers. m Craniofacial traits between Wwp2 enhancer KO mice and
   wild-type littermates. n Head lengths, a trait homologous to skull
   length (distance between A and B in m), in human SV carriers and
   noncarriers. o IL-6 levels after flagellin stimulation in human SV
   carriers and noncarriers. The mice were measured at five weeks of age.
   All data are represented as the mean ± SEM. Box plots in (k, l, n, o)
   show interquartile range (IQR), with the middle line indicating the
   median, and whiskers representing 1.5-fold IQR. P values were
   calculated by one-way ANOVA for (d, e), two-way ANOVA for (f–h),
   one-way Kruskal-Wallis test for (k, l, n, o), and by two-tailed t-test
   for (i, j, m). The number of the samples in box plots and the exact P
   values are shown in the graph. a, n were created with BioRender.com.

   We first validated this SV using PCR-based Sanger sequencing and local
   assembly of the long reads at this region (Fig. [273]7a–c,
   Supplementary Fig. [274]8; Methods). Our analyses further suggest that
   the SINE-mediated rearrangement led to a false-positive detection of
   a ~4.5 kb deletion signal in studies conducted with SRS and LRS
   technologies when the reads were mapped to this region (Supplementary
   Fig. [275]8; Methods).

   The SV in WWP2 overlapped a 1127-bps annotated enhancer region
   (chr16:69,820,390‒69,821,516, GRCh38) that is likely to be active in
   multiple bone-related cells/tissues as well as breast, brain, duodenum
   and stomach tissues according to the annotations of the Roadmap
   Epigenomics Project^[276]74 and regulates the expressions of different
   isoforms of WWP2 based on Hi-C data (Fig. [277]6 and Supplementary
   Figs. [278]9, [279]10). The enhancer potential of the region was first
   validated using luciferase reporter assays. The plasmid containing the
   1127-bps sequence demonstrated significantly greater luciferase signal
   levels (P < 0.0001, one-way ANOVA) than the empty vector after
   normalization (Fig. [280]7d). In addition, the presence of the SV,
   especially when both the insertion and the deletion were concurrently
   presented, significantly reduced luciferase signals (P < 0.0001,
   one-way ANOVA), indicating that the complex SV disrupted the enhancer
   activity (Fig. [281]7d). Furthermore, RT-qPCR revealed that carriers of
   the SV exhibited a significant decrease in WWP2 expression in the whole
   blood (P < 0.05, one-way ANOVA) versus noncarriers (Fig. [282]7e).

   The SV was associated with standing height in our cohort, individuals
   carrying this SV displayed a lower standing height (median
   height = 164 cm) than that of the noncarriers (median height = 165 cm),
   consistent with the reduced body length caused by knocking out of the
   Wwp2 gene in mice^[283]67,[284]75. While the difference was not
   statistically significant due to the relatively small sample size
   (n = 1016) in the present study (P = 0.28, one-way ANOVA), we found
   that the SV is in complete linkage disequilibrium with rs8052428-T,
   which is ~60 kb downstream to the SV, based on manual examination of
   the Pacbio HiFi data from a Chinese quartet family
   ([285]http://chinese-quartet.org/)^[286]76. rs8052428-T is
   significantly associated with shorter sitting height (Beta = −0.101,
   P = 1.1E-29) and standing height (Beta = −0.132, P = 1.0E-17) in a
   prior genome-wide association study (GWAS) of height variation of
   361,194 UK Biobank samples ([287]http://www.nealelab.is/uk-biobank/).
   However, no causal variant has yet been identified at this locus. We
   observed that rs8052428 is located at the intronic region of WWP2 and
   there is no enhancer annotation around the region (Supplementary
   Fig. [288]11). This supports the SV, rather than rs8052428, is the
   causal variant of height variation not only in the Han but also in
   samples of European ancestry.

   To further understand the functional impacts of the SV, we depleted a
   1267-bps region, that is homologous to the SV-affected region based on
   the liftOver tool in the UCSC genome browser website^[289]77, in the
   mouse genome using CRISPR/Cas9 technology (Methods). The KO mice were
   born at the expected Mendelian ratio and grew normally. We observed
   significant reductions in the expression of full-length Wwp2 isoform,
   as well as Wwp2-N and Wwp2-C isoforms, in spleen, kidney, and muscle
   tissues of homozygous enhancer knockout mice compared to their
   wild-type littermates. Most of these reductions were statistically
   significant (Fig. [290]7f–h, P < 0.05, two-way ANOVA), confirming the
   homologous region is functional as an enhancer, in the mouse genome.

   Furthermore, the Wwp2 enhancer KO mice in the present study displayed
   significantly shorter body length and lower body weight (P < 0.01,
   two-tailed t-test) than their wild-type littermates (Fig. [291]7i).
   Remarkably, the mice with the enhancer knocked out exhibited
   substantial rises in body fat percentages (BF%) (Fig. [292]7j,
   P < 0.05, two-tailed t-test) than the wild-type mice, which has not
   been observed in previous Wwp2 KO mouse models. This result is further
   exemplified by observations in humans showing that carriers of the SV
   have significantly greater body fat percentage (Fig. [293]7k, P < 0.05,
   one-way Kruskal-Wallis test) and larger visceral fat area
   (Fig. [294]7l, P < 0.05, one-way Kruskal-Wallis test) than the
   non-carriers.

   Deep phenotyping data in humans and introducing human mutations into
   mouse genomes have enabled the identification of the irreproducible and
   unreported phenotypes in previous mouse KO experiments. For instance,
   the abnormal craniofacial and dental changes reported in Wwp2 knockout
   mice^[295]67 could not be replicated in our humanized mouse model and
   human carriers. Instead, we observed that the Wwp2 enhancer KO mice
   exhibited a shorter skull length (Fig. [296]7m, distance between A to
   B, P < 0.01, two-tailed t-test), a more domed skull, and a shortened
   snout (Fig. [297]7m, distance between A to C, P < 0.05,
   two-tailedt-test) in comparison with the wild-type littermates using
   micro-CT. This was further exemplified by our human samples, as we only
   observed the head length (the distance between the glabella and
   opisthocranion), a homologous trait to skull length in mice, was
   significantly shorter in the SV carriers (P < 0.05, one-way
   Kruskal-Wallis test) than that in the noncarriers (Fig. [298]7n) but no
   other abnormal craniofacial or dental phenotypes were observed in human
   carriers.

   Our results suggest that the SV in WWP2 may yield an increased innate
   immune response to its carriers. While there was no significant
   difference in the IL-6 levels in the whole blood among the samples in
   the independent cohort (Supplementary Fig. [299]12a), we observed
   significant increases in IL-6 in response to flagellin stimulation, a
   subunit protein that polymerizes to form the filaments of bacterial
   flagella and a Toll-like receptor 5 agonist, in the carriers of the SV
   compared to that in the noncarriers (P < 0.05, one-way Kruskal−Wallis
   test) (Fig. [300]7o), consistent with the observation in Wwp2 KO
   mice^[301]78. In addition, we observed a similar trend of IL-6 level
   increase after using another three stimulants, including
   Pam3-Cys-Ser-Lys4 (Pam3CSK4, a synthetic triacylated lipopeptide and
   Toll-like receptor 1/2 agonist, Supplementary Fig. [302]12b),
   Lipopolysaccharide (LPS, an outer membrane component of gram-negative
   bacteria and a Toll-like receptor 4 agonist, Supplementary
   Fig. [303]12c), and Resiquimod (R848, a tricyclic organic molecule and
   an agonist of TLR7/TLR8, Supplementary Fig. [304]12d). However, the
   increases after stimulations did not reach statistical significance
   (Supplementary Fig. [305]12b–d).

   We studied the evolution of the SV at the WWP2 locus based on the
   genotyping results of the high-coverage SRS data of 3201 samples from
   26 populations in the 1000 Genomes Project^[306]46 (Fig. [307]8a and
   Supplementary Data [308]5) and samples from four archaic humans
   (Fig. [309]8b) as well as the alignment of the human and chimpanzee
   reference genomes at this locus (Methods). We observed that the SV was
   not present in the genomes of chimpanzees (Supplementary Fig. [310]6b)
   and the four archaic humans (Fig. [311]8b). In addition, it is rare in
   African populations (average AF = 0.03) but common (average AF > 0.2)
   in all non-African populations of the 1000 Genomes Project
   (Fig. [312]8a). Similar to the observation in the GSDMD, we identified
   the signature of positive selection at this SV in multiple non-African
   populations including the Han based on the genome-wide distribution of
   F[ST] (Supplementary Fig. [313]7; Methods). Collectively, our results
   suggest that this SV likely emerged at the common ancestor of modern
   humans and then is likely to undergo positive selection after humans
   migrated out of Africa (Fig. [314]8b).

Fig. 8. Evolution and demographic distribution of the SV at the WWP2 locus.

   [315]Fig. 8
   [316]Open in a new tab

   a The allele frequency of the SV (indicated in green) in the genomes of
   3201 samples from 26 populations in the 1000 Genomes Project^[317]46.
   In the pie charts, the frequencies of the reference and deletion
   alleles are depicted in yellow and green, respectively. b The
   read-mapping results around the complex SV (indicated by the red box)
   at the WWP2 locus in the genomes of six representative modern humans
   and four archaic humans. The Han Chinese sample who carries a
   homozygous SV is from the present study. We obtained the high-coverage
   SRS data of one Chinese-Dai individual in Xishuangbanna (CDX, ID:
   HG01798) who carries a homozygous SV, one Gujarati Indians in Houston,
   TX (GIH, ID: NA20854) who carries a heterozygous SV, one Toscani in
   Italy (TSI, ID: NA20585) who carries a homozygous SV, one Puerto Rican
   in Puerto Rico (PUR, ID: HG01047) who carries a heterozygous SV, and
   one Luhya in Webuye, Kenya (LWK, ID: NA19318) who carries a homozygous
   SV. The phylogeny was constructed based on previous
   studies^[318]2,[319]121–[320]126. Due to its presence in the genomes of
   all modern human populations and its absence from the four archaic
   human genomes, this SV is likely to have emerged in the common ancestor
   of modern humans (highlighted by the red arrow). IBS: Iberian
   populations in Spain; GBR: British in England and Scotland; FIN:
   Finnish in Finland; PJL: Punjabi in Lahore, Pakistan; ITU: Indian
   Telugu in the UK; STU: Sri Lankan Tamil in the UK; BEB: Bengali in
   Bangladesh; CHS: Han Chinese South; CHB: Han Chinese in Beijing, China;
   JPT: Japanese in Tokyo, Japan; KHV: Kinh in Ho Chi Minh City, Vietnam;
   CEU: Utah residents (CEPH) with Northern and Western European; MXL:
   Mexican Ancestry in Los Angeles, California; CLM: Colombian in
   Medellin, Colombia; PEL: Peruvian in Lima, Peru; ACB: African Caribbean
   in Barbados; ASW: African Ancestry in Southwest US; ESN: Esan in
   Nigeria; GWD: Gambian in Western Division, The Gambia - Mandinka; MSL:
   Mende in Sierra Leone; YRI: Yoruba in Ibadan, Nigeria.

Discussion

   In this study, we investigated the genome-wide SV diversity of 945
   samples of Han ancestry using LRS. Our large-scale LRS-based SV
   detection approach revealed tens of thousands of unreported SVs, many
   of which were predicted to be functional and of potential biomedical
   relevance. Moreover, we likely cataloged a majority of common SVs in
   the Han population. The SV catalog from this cohort can thus help
   prioritize variants identified in clinical sequencing and serve as a
   reference panel for incorporating SVs into future GWAS^[321]12.
   Furthermore, by integrating these Han Chinese SVs with additional
   diverse populations, we can gain new insights into human evolution and
   local adaptation from an understudied SV perspective, complementing
   existing knowledge primarily based on SNPs. Therefore, the
   comprehensive SV dataset compiled here represents a valuable genomic
   resource with diverse applications across life sciences and medicine
   research.

   Our study provides insights into the complex mosaic structure of
   genomic variation in the Han (Fig. [322]2). Tracing the ancestral
   origins of SVs reveals contributions spanning highly divergent
   evolutionary timescales, from millions of years in our hominin ancestry
   to ongoing mutation processes. Analyses of the functional effects and
   phenotypic consequences of variants emerging at different periods will
   shed further light on the genetic changes involved in the origin and
   diversification of anatomically modern human populations^[323]79.

   Aligning with the findings in previous studies^[324]3,[325]22–[326]26,
   our analyses suggest that a substantial portion of phenotypic diversity
   and differences in disease susceptibility among humans can be
   attributed to SV-mediated genomic alterations. In addition, we
   identified the causal SVs that influence WWP2 and GSDMD and their
   associated phenotypes in humans. Pinpointing causal variants for
   complex traits has proven remarkably challenging, mainly because the
   causal variants normally reside in many closely correlated variants due
   to linkage disequilibrium. Yet doing so provides pivotal insights into
   key questions surrounding modern human evolution, such as the origin
   and evolutionary trajectory of a phenotype and the genetic basis of
   disease susceptibility disparity in modern humans. Despite their
   prevalence and functional importance, SVs have not been widely used in
   current GWAS, which remains a primary method for studying complex
   traits/diseases but relies primarily on SNPs^[327]12. This has further
   increased the difficulty of identifying causal variants underlying
   complex traits or diseases. For example, while our results indicate
   that the SV in WWP2 is a previously unidentified locus associated with
   craniofacial polymorphism in humans (Fig. [328]7), searches of the EBI
   GWAS Catalog through November 2023 have not implicated any variants in
   WWP2 in previous genome-wide association studies of human craniofacial
   variation. Therefore, there is a strong need to integrate SVs into
   future GWAS when studying complex traits in humans^[329]12.

   We also identified the phenotypes reported in previous KO mouse models
   can be misleading or not replicable in humans and humanized mice. For
   example, prior mouse models exhibited decreased BMD with Gsdmd
   deficiency^[330]56. However, we revealed the opposite effect in human
   and mouse carriers of the SV in Gsdmd, who displayed significantly
   increased BMD compared to non-carriers. This disconnect was further
   highlighted with Wwp2. We showed that both human and mouse carriers of
   the Wwp2 structural variant have reduced height and weight, yet the
   abnormal craniofacial and dental changes reported in Wwp2 KO
   mice^[331]67 were not observed in human carriers or our humanized mouse
   model. One possibility to explain this discrepancy could be caused by a
   compensatory effect in which knocking out Gsdmd or Wwp2 may activate
   the expression of other members of the gene family. This compensation
   may not occur or may be less pronounced when knocking out the enhancer
   of the gene, where some level of expression remains. However, these
   results could also be explained by knockout experiments, which often
   remove a large genomic segment and can generate confounding phenotypes
   by simultaneously disrupting other genes directly or the regulatory
   elements of other genes^[332]69. These discrepancies emphasize the need
   to validate model organism phenotypes in human genetic data.

   Our findings suggest a dynamic interplay between genetic variants and
   environmental pressures that, wherein genetic variants that were
   adaptive in the past can become maladaptive in modern
   environments^[333]80. For example, while emerging at different times,
   the causal SVs in the WWP2 and GSDMD genes are both rare in Africans
   while showing signs of positive natural selection in numerous
   non-African populations (Figs. [334]5 and [335]8). The WWP2 variant
   yields increased inflammatory responses and adiposity for its carriers,
   particularly the accumulation of visceral fat. This may have provided
   an evolutionary advantage during human migration and adaptation to new
   environments that often lacked reliable food sources or posed
   unfamiliar pathogenic threats. However, increased visceral fat is
   associated with an increased risk for multiple diseases in modern
   societies, such as high blood pressure, obesity, elevated cholesterol,
   and insulin resistance^[336]81. The SV in the GSDMD locus is associated
   with increased BMD. Denser, stronger bones would have reduced
   susceptibility to fractures and improved load-bearing capacity, which
   could be beneficial in physically demanding environments or situations
   where physical prowess is necessary for survival, such as hunting,
   combat, or escaping predators. However, osteosclerotic bones also tend
   to be misshapen and abnormally enlarged. This progressive skeletal
   deformity can cause reduced mobility, balance impairments, and
   difficulty with fine motor tasks^[337]82. Overall, by unraveling the
   phenotypic effects and evolutionary origins of variants like the SVs in
   WWP2 and GSDMD, we gain valuable insights into genomic diversity and
   local adaptation during human migration across the globe.
   Reconstructing this complex history is also key to understanding
   geographic differences in disease and designing personalized medical
   strategies tailored to patients’ ancestry.

   Our findings provide a rapid and cost-effective predictive biomarker
   that could enable personalized risk stratification when treating cancer
   before cisplatin administration and other GSDMD-mediated pyroptosis in
   organ injuries. Using a PCR-based screening for the SV affecting a
   conserved enhancer regulating the expression of GSDMD in patients,
   clinicians may be able to adjust cisplatin dosing or provide
   alternative regimens for susceptible patients and thus reduce the risks
   of acute kidney injury. As cisplatin is a widely used and effective
   chemotherapeutic against multiple solid tumor types, mitigating its
   toxicity to the kidney and other organs could significantly improve
   clinical outcomes and quality of life for many patients. In addition,
   the dysregulation of GSDMD has also been implicated in multiple organ
   injuries^[338]83, such as liver^[339]84,[340]85, brain^[341]86,[342]87,
   spinal cord^[343]88–[344]91, heart, and vessel^[345]83,[346]92–[347]94,
   the SV can serve as a promising intervention target by selectively
   controlling GSDMD expression in relevant cell types and thus allowing
   interventions before downstream injury processes
   propagate^[348]95–[349]97.

   The present study has some limitations. First, the functional
   divergence between human and mouse orthologous genes may have misled us
   to interpret the functional impacts of some SVs. Experiments utilizing
   conditional knockout or knocking in the human variants in mouse
   models^[350]98 or knockdown of genes in human organoid models using
   siRNA or shRNA, or gene editing of human embryonic stem cells or
   induced pluripotent stem cells using CRISPR/Cas9 can overcome some
   limitations of conventional knockouts and further elucidate gene
   functions in a human context. Additionally, our estimates of SV
   emergence and evolution are constrained by factors including the
   availability of only four sequenced archaic human genomes and genetic
   polymorphisms not represented within the 1000 Genomes Project samples.
   Moreover, since we primarily focused on studying the common SVs, we may
   have missed some rare, but functionally important SVs. As we enter the
   ‘phenomics’ era^[351]99,[352]100, characterized by the acquisition of
   high-dimensional phenotypic data in diverse global populations using
   various ‘-omics’ datasets and the ongoing advancements in both in vitro
   and in vivo technologies as well as genomic data from underrepresented
   populations, we anticipate that the functions of SVs will be elucidated
   in a broader range of human populations in the future^[353]80.

Methods

Inclusion and ethics

   The study complied with all relevant regulations for working with human
   subjects in China. The Ethics Committee of the School of Life Sciences,
   Fudan University, Shanghai, China approved the study. Participants were
   recruited to a project studying physical anthropology diversity in
   China funded by the Ministry of Science and Technology of the People’s
   Republic of China (2015FY111700). Informed consents were approved by
   all participants. Our study is compliant with the Guidance of the
   Ministry of Science and Technology (MOST) of China for the Review and
   Approval of Human Genetic Resources.

Sequencing library preparation

   We extracted DNA from whole blood using a TGuild Blood sample genomic
   DNA kit from TIANGEN®. The ONT libraries were prepared as follows:
   2‒5 µg genomic DNA was sheared to ~20‒30 kb fragments in a g-TUBE
   (#520079, Covaris) by spinning twice for 2 min at 7000 rpm in an
   Eppendorf MiniSpin centrifuge. The samples were mixed with 6.5 μl of
   NEBNext FFPE DNA Repair Buffer, 2 μl of NEBNext FFPE DNA Repair Mix
   (New England Biolab, M6630), and 3.5 μl of nuclease-free water (NFW)
   and incubated at 20 °C for 15 min for DNA repair. We re-pooled and
   cleaned up the samples using a 0.8× volume of AMPure XP beads (Beckman
   Coulter) according to the manufacturer’s instructions. The
   purified-ligated DNA was resuspended in 15.5 μl ELB (SQK-LSK108). A
   1-μl aliquot was quantified by fluorometry (Qubit) to ensure ≥500 ng
   DNA was retained. The final library was prepared by mixing 35.0 μl RBF
   (SQK-LSK108), 25.5 μl LBB (SQK-LSK108), and 14.5 μl purified-ligated
   DNA.

ONT sequencing and base calling

   We conducted long-read sequencing of 945 Han Chinese individuals using
   an ONT PromethION sequencer with a 1D flow cell and protein pore R9.4
   1D chemistry according to the manufacturer’s instructions. Base calling
   was performed using Guppy version 3.2.0^[354]101 with the “flipflop”
   algorithm on the PromethION compute device.

Whole genome sequencing, phenotypic and immunogenic and metabolic profiling
of an independent cohort

   We conducted whole genome sequencing of 1016 samples using the
   BGI-DNBSEQ-T7 platform. Briefly, the DNA samples were extracted from
   whole blood. Paired-end libraries (with insert size around 280 bps)
   were prepared based on the manufacturer’s instructions. For each
   sample, we sequenced 100 bps at each end and obtained >90 Gb sequencing
   reads, corresponding to 30X coverage. We removed the reads <25 bps, or
   ≥10% unidentified nucleotides (N), or 50% bases having Phred quality
   <5, or >10 nt aligned to the adapter/primer/barcode/index, allowing
   ≤10% mismatches, or show significant biased sequence composition
   difference such as the reads between the A/T and G/C ratio greater than
   20%, or potential PCR duplicates.

   The sequencing reads were mapped to the human reference genome
   GRCh38.p13 using BWA-MEM mode (version 0.7.17)^[355]102 with default
   parameters. The mapping results were processed using SAMtools version
   1.15.1^[356]103 to sort and index. Finally, we used GATK version
   4.1.7.0^[357]104 to mark duplicated reads. We genotyped the LRS-based
   SVs in this cohort using Paragraph version 2.3^[358]47 with default
   parameters.

   We also measured the standing height of the samples in this cohort.
   Whole-body BMD of 630 volunteers was assessed by dual-energy X-ray
   absorptiometry (DXA). The BMD (g/cm^2) measurement was performed by the
   Lunar iDXA (GE Healthcare, General Electric Co., Boston, MA, USA) using
   standard testing procedures.

   Inflammatory factor levels were measured as described in a previous
   study^[359]105. Specifically, venous blood was collected from healthy
   individuals into sodium heparin tubes for whole blood stimulation
   assays. Blood was diluted 5-fold in a 48-well plate pre-filled with
   RPMI media (Wisent #350-000-CL) (10% FBS (Gibco #10099141), 1%
   penicillin/streptomycin (Thermo #15140122)) and four stimulants:
   Pam3CSK4 (Invivogen #tlrl-pms, 100 ng/mL), flagellin (Invivogen
   #tlrl-stfla, 100 ng/mL), LPS (Sigma #L2880, 10 ng/mL), R848 (Invivogen
   #tlrl-r848, 500 ng/mL), or no stimulant (negative control). Plates were
   incubated at 37 °C with 5% CO[2] for 24 h. Supernatants were collected
   by centrifuging at 500 g for 10 min and immediately frozen at −80 °C
   until analysis. The levels of IL-6 were quantified using ELISA
   (Biolegend #555220) per manufacturer protocol.

   We purchased PG (15:0/15:0) (#322647-32-5), PI (17:0/14:1)
   (#1246304-61-9) from Sigma-Aldrich (St. Louis, MO, USA) and the
   LipidyzerTM kit from SCIEX (Chromos, Singapore). Sinopharm Chemical
   Reagent Co., Ltd. (Shanghai, China) provided the HPLC grade Chloroform.
   Merck (Darmstadt, Germany) supplied Methanol (MeOH) (#67-56-1),
   isopropanol (IPA) (#67-63-0), methyl tert-butyl ether (MTBE)
   (#1634-04-4) and dichloromethane (DCM) (#75-09-2). Sigma-Aldrich (St.
   Louis, MO, USA) obtained acetonitrile (ACN) (#75-05-8), formic acid
   (FA) (#64-18-6), ammonium hydroxide (NH4OH) (#1336-21-6) and ammonium
   acetate (NH4OAc) (#631-61-8). We purified ultra-pure water using the
   Milli-Q water purification system (Millipore, MA, USA).

   We extracted 8 µl plasma samples and 8 µl lipid standard mixture using
   the reported Matyash method with MTBE/methanol/water (10:3:2.5,
   V/V/V)^[360]106. We analyzed all lipids based on previously reported
   methods^[361]106 with some minor modifications. Here we used an
   LC-MS-based method on a QTRAP 6500 plus mass spectrometer (Sciex, USA)
   coupled with an ultrahigh performance liquid chromatography system
   LC30A (Shimazu, Japan). We employed a BEH HILIC column (2.1 100 mm,
   1.7 μm) and Kinetex C18 column (2.1 100 mm, 2.6 μm), where appropriate,
   with solvent flow rates of 0.5 mL min-1 and 0.3 mL min-1, respectively.
   Mobile phases included H2O/ACN (5:95, V/V) and H2O/ACN (1:1, V/V), both
   containing 10 mM NH4AC for the former, while H2O/MeOH/ACN (1:1:1,
   V/V/V) containing 7 mM NH4OH and IPA containing 7 mM NH4AC for the
   latter. We used a 12-min gradient elution for both columns. The samples
   injection volumes were 2 μl and 1 μl respectively, and we set the
   column temperature at 45 °C.

   We set the MS parameters of MRM acquisition mode for lipid species as
   previously described: collision gas (CAD), medium level; CUR, 40 psi;
   the GS1 and GS2, both 55 psi; the ion spray voltage value was
   relatively set at 5500 V or –4500 V; Turbo V source temperature,
   350 °C; DP, ±70 V. We optimized the collision energy values for these
   subclasses, respectively.

   We used Analyst (v1.7, SCIEX, Chromos, Singapore) and OS (v1.7, SCIEX,
   Chromos, Singapore) to quantify the data for each lipid species. OS
   extracted the peak area and tR from the raw data. We calculated the
   lipid concentration by dividing the peak area of each lipid species by
   the peak area of the corresponding internal standard (IS) and then
   multiplying by the concentration value of the IS.

   We measured the head length of the samples using a spreading caliper.

LRS read mapping

   The sequencing reads were mapped to the human reference genome
   (GRCh38.p13, without alternate sequences) using NGMLR version
   0.2.7^[362]28 with parameters specifically tailored for ONT reads. The
   mapping reads were sorted and indexed using SAMtools version
   1.17^[363]103.

LRS-based SV calling and quality control

   To account for the moderate coverage of our samples, a joint-calling
   strategy was used to call SVs in this study. Specifically, we first
   conducted SV calling per sample using cuteSV version 1.0.13^[364]29
   with parameters recommended for ONT reads by the developers. We merged
   SVs whose breakpoints were within 500 bps across individuals using
   SURVIVOR version 1.0.6^[365]30. Based on the merged SV set, we used
   LRcaller version 1.0^[366]23 to re-genotype each sample and then
   remerged all SVs using the BCFtools (version 1.17) merge
   command^[367]103 to generate a complete SV set. To obtain high-quality
   SV calls, we removed SVs located in centromeric, pericentromeric and
   gap regions in GRCh38.

SV genotyping using SRS data of the 1000 Genomes Project and four archaic
humans

   We extracted the SRS reads around the GSDMD
   (chr8:143,541,891‒143,564,066, GRCh38) and WWP2
   (chr16:69,810,428‒69,834,903, GRCh38) loci from the high-coverage BAM
   files of the samples of the 1000 Genomes Project from
   [368]https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high
   _coverage/^[369]46 using SAMtools version 1.15.1^[370]103. We conducted
   SV genotyping using Paragraph version 2.3^[371]47 with default
   parameters.

   The BAM files of four archaic humans were downloaded from
   [372]http://ftp.eva.mpg.de/neandertal/^[373]62–[374]64 and
   [375]http://cdna.eva.mpg.de/denisova/^[376]65. As the sequence read
   data for the four archaic humans were mapped to the GRCh37 human
   reference genome, we converted the breakpoint coordinates identified in
   this study from GRCh38 to GRCh37 using the liftOver tool in the UCSC
   Genome Browser^[377]77 for comparative analyses.

   To examine whether SVs that are fixed (AF = 1) or nearly fixed
   (AF > 0.9) in the Han population also exist with high frequency in
   other global populations. We genotyped all the high-frequency SVs in
   the Han population using Paragraph version 2.3 with default
   parameters^[378]47 based on 2503 samples in the 1000 Genomes
   Project^[379]46 and 50 samples were randomly chosen from an independent
   cohort in our present study. The allele frequency of each SV was
   calculated using BCFtools (version 1.17)^[380]103 with the command of
   “bcftools +fill-tags sv.vcf.gz – -t AF”.

   To explore the SV diversity in the Han in a global context, we
   genotyped 110,863 SVs (after excluding 372 TRAs and 53 SVs containing
   non-ATGC bases in their alternative sequence that cannot be genotyped
   by Paragraph^[381]47) using Paragraph version 2.3 with default
   parameters^[382]47 based on 2503 unrelated high-coverage sequencing
   genomes of the 1000 Genomes Project^[383]46 as well as four archaic
   humans^[384]62–[385]65. In addition, to reduce the potential
   limitations of SRS for accurate SV genotyping, we used LRcaller
   v1.0^[386]23 to genotype the 110,863 SVs in this study across genomes
   from 38 global populations from multiple studies^[387]1,[388]3,[389]20.

   The information of the world map is sourced from the GGV website
   [390]http://www.popgen.uchicago.edu/ggv.

A comparison to the SVs in two chimpanzee genomes and 405 Chinese genomes

   We first downloaded the VCF file named ‘AG18359_ONT.hg38.vcf’ from a
   previous study^[391]48, in which the authors mapped the ~30X ONT reads
   of chimpanzee AG18359 lymphoblastoid cell lines to the human reference
   genome GRCh38 using minimap2^[392]107 (v2.17-r941) and identified
   95,245 SVs (length ≥50 bps) using Sniffles^[393]28 (v1.0.11). After
   converting the VCF to BED format, we intersected these SVs with those
   identified in the chimpanzee genome using BEDTools (v2.30.0)^[394]108
   applying a 50% reciprocal overlap threshold. This analysis revealed
   9,509 SVs of the same type shared between the chimpanzee genome and our
   cohort.

   To further explore whether the SVs are polymorphic across the
   chimpanzees, we analyzed HiFi reads (NCBI accession number:
   SRR21642075, SRR21642086, SRR21755676, SRR21755677, SRR21755678,
   SRR21755679, and SRR21755680) from an additional chimpanzee genome,
   Clint_PTR^[395]109. We aligned the sequencing data to the human
   reference genome GRCh38.p13 using minimap2 v2.26-r1175^[396]107 and
   detected SVs using cuteSV v1.0.13^[397]29. Consistent with the above
   description, we determined the presence of SVs in this chimpanzee
   genome by assessing whether SVs in this study exhibited a reciprocal
   overlap of 50% with those in the chimpanzee genome.

   We also compared the SVs in this study with those in 405 Chinese
   genomes^[398]22 using the same reciprocal overlap criterion of 50%.

   When an SV overlaps with exons of genes or with enhancers of target
   genes, as identified using the GeneHancer database^[399]54, it is
   considered to affect those genes.

PCR-Sanger-based validation

   100 common SVs, including 45 insertions, 45 deletions, four inversions,
   and six duplications, with sizes ranging from 53 to 81,592 bps, were
   initially screened through a manual examination using IGV^[400]33
   alignment with Pacbio HiFi data from the Chinese quartet family, the
   parents and two identical twins
   ([401]http://chinese-quartet.org/)^[402]76. We developed two pairs of
   primers for each SV. If the first primer pair yielded successful
   results, we will not proceed to use the second primer pair. The primers
   were designed using the strategy shown in Supplementary Fig. [403]2 and
   were listed in Supplementary Data [404]2. The PCR validation was
   performed based on the DNAs of the Chinese quartet family, which was
   kindly provided by Dr. Leming Shi at Fudan University. For technical
   validation of SVs, 2 * Phanta Max Master Mix (Dye Plus) (Vazyme,
   #P525-01) was used, with 25 ng of genomic DNA (Homo sapiens) as
   template in a final volume of 25 μl reaction volume. PCR amplification
   cycling included denaturation for two min at 95 °C, followed by 35
   cycles of 10 s 98 °C, 30 s 68 °C, x min 72 °C (30 s/kb).

   All PCR products were subjected to agarose gel electrophoresis. The
   sizes of PCR products were checked based on the Trans 2 K® Plus II DNA
   Marker (Trans, #BM111-01). In addition, the PCR products were further
   confirmed using Sanger sequencing (by Beijing Tsingke Biotech Co.,
   Ltd.). The results of the Sanger sequencing were in Supplementary
   Data [405]8.

   The details for the PCR-Sanger-based validation are provided in the
   Supplementary Note 1.

Estimations of false discovery rate (FDR) and false negative rate (FNR)

   To estimate the boundaries of the FNR and FDR of our dataset, we
   adopted a method from Beyter et al.^[406]23, which is based on
   comparisons with the high-frequency (AF > 0.5) SVs in the global
   populations. Specifically, we used two datasets (Audano et al.^[407]1
   and Human Genome Structural Variation Consortium Phase 2
   (HGSVC2)^[408]3, which respectively characterized SV diversity of 15
   and 35 individuals of diverse ancestry, into the comparisons. We used
   GORpipe version 3.10.1^[409]110 with the parameters “A.gor | join
   -snpseg -f 500 B.gor”, which examines a variant in dataset A that is
   also present (within 500 bps of the start position of the variant
   discovered in dataset A) in dataset B.

Identification of the unreported SVs in the present study

   We examined that whether the SVs in the present study were reported in
   five publicly available SV datasets, including gnomAD^[410]12, Audano
   et al., 2019^[411]1, HGSVC2^[412]3, Wu et al. ^[413]22, and
   dbVar^[414]34 (as of May 19, 2023). We used liftOver in the UCSC Genome
   Browser^[415]77 to convert the coordinates of SVs identified by gnomAD
   from GRCh37 to GRCh38. When a reciprocal overlapping rate between an SV
   in our and published datasets ≥50%, we defined it as reported before.

   The functional impacts of these unreported SVs were predicted based on
   gene coordinates of the RefSeq databases (as of August 17,
   2020)^[416]111 using AnnotSV (version 3.1.1)^[417]49–[418]51. We
   obtained the intersections of unreported SVs with transcription factor
   binding site, and the chromatin state segmentations for cell lines
   GM12878, H1-hESC, HepG2, HMEC, HSMM, HUVEC, K562, NHEK, and NHLF from
   ENCODE project^[419]35 after using liftOver^[420]77 software to convert
   the coordinates of SVs from the GRCh38 version to GRCh37 based on
   ANNOVAR (version 2019Oct24)^[421]36.

Analyses of SV breakpoints and repeat elements

   We downloaded the repeat elements annotation file of GRCh38 from the
   UCSC website^[422]77
   ([423]https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt
   .gz). The intersect module in BEDTools version 2.30.0^[424]108 was used
   to examine whether the SV breakpoints overlapped repeat elements or
   not.

Calculation of the distances between SVs and telomere regions

   The telomere positions on each chromosome were downloaded from the UCSC
   online website^[425]77 using the Table Browser tool^[426]112
   ([427]https://genome.ucsc.edu/cgi-bin/hgTables). The distance between
   SVs and telomeres was defined as the minimum distance between SV
   breakpoints and telomeres on both sides.

A gene-based function annotation and gene ontology enrichment analysis

   A gene-based annotation was performed using AnnotSV (version
   3.1.1)^[428]49–[429]51 with the gene coordinates of the RefSeq
   databases (as of August 17, 2020)^[430]111. We performed gene ontology
   and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
   analysis using the 2023 version of the DAVID web server^[431]52,[432]53
   ([433]https://david.ncifcrf.gov/). The gene ontology and KEGG pathways
   with false discovery rate (FDR) < 0.1 were considered as significant.

Visualization of epigenomic data for SVs at GSDMD and WWP2 loci

   We used UCSC Genome Browser ([434]https://genome.ucsc.edu/)^[435]77 to
   visualize epigenomic data for SVs at the GSDMD and WWP2 loci. We
   obtained the enhancer regulatory elements annotated by GeneHancer
   database^[436]54, ChIP-seq signals of H3K4me1, H3K27ac, CTCF, and
   transcription factor binding sites from the Roadmap Epigenomics
   project^[437]74 and ENCODE project^[438]35, regulatory elements from
   ORegAnno^[439]113, and Hi-C chromatin interactions reported by Rao et
   al.^[440]114. The Hi-C interactions were documented using the ‘Interact
   Track Format’
   ([441]https://genome.ucsc.edu/goldenPath/help/interact.html). To
   visualize the Hi-C interactions, we first used the Table Bowser tool
   ([442]https://genome.ucsc.edu/cgi-bin/hgTables) ^[443]112 in the UCSC
   genome browser to download all the Hi-C interactions in the regions
   (WWP2 locus: chr16:69,685,000‒70,025,000, GSDMD locus:
   chr8:144,620,000‒144,655,000, GRCh37). We then used the intersect
   module of BEDTools version 2.30.0^[444]108 to obtain the Hi-C
   interactions between the SV region and the regions located within 1 kb
   upstream of the transcription start sites of WWP2 and GSDMD. We
   uploaded the remaining interactions to the UCSC Genome Browser website
   ([445]https://genome.ucsc.edu/)^[446]77 for visualization.

Identification of the common SVs located in the annotated enhancer regions in
the human reference genome

   Using the intersect module of BEDTools toolkit version 2.30.0^[447]108,
   we searched the common SVs (with MAF ≥ 5%) in our dataset that
   completely overlapped the enhancer annotations from ENCODE
   project^[448]35, Roadmap Epigenomics project^[449]74, and GeneHancer
   database^[450]54.

Exploring the functional significance of genes regulated by common SVs in
mice

   We explored the potential functions of common SV-regulated genes based
   on the MGI database^[451]66. We first mapped the genes regulated by SVs
   to homologous genes in mice based on the HMD_HumanPhenotype.rpt file
   downloaded from
   [452]https://www.informatics.jax.org/downloads/reports/index.html#pheno
   . After excluding genes that were lethal or genes that had no obvious
   phenotypic effects in mouse KO models, we then classified the
   phenotypic changes in the remaining genes based on the
   MGI_GenePheno.rpt file downloaded from
   [453]https://www.informatics.jax.org/downloads/reports/index.html#pheno
   .

Local assembly of the SV at the WWP2 locus

   We extracted LRS reads spanning over the breakpoints from the bam files
   of the carriers of the SV in our cohort, and used Flye version
   2.9.2^[454]115 to perform local assembly. The assembled contig was
   aligned to the reference genome using the online BLASTN
   ([455]https://blast.ncbi.nlm.nih.gov/Blast.cgi) based on which the
   dotplot in Fig. [456]7c was generated.

A SINE-mediated rearrangement in WWP2 causes false positive detection of
4.5 kb deletion

   We utilized Manta version 1.6.0^[457]116 and Delly version
   1.1.8^[458]117 to detect SV around the WWP2 locus based on
   high-coverage BAM files of the samples of the 1000 Genomes Project from
   [459]https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high
   _coverage/^[460]46. In total, 138 samples were used (sample IDs are
   listed in Supplementary Data [461]9). We set the “–generateEvidenceBam”
   parameter for Manta and “-d” parameter for Delly to report the reads
   that support the detection of the SV. Both Manta and Delly reported
   a ~4.5 kb deletion (Manta: chr16:69,820,428‒69,824,903, Delly:
   chr16:69,820,426‒69,824,899). We found that the reads that support this
   deletion originated from the fused SINE element region using
   Integrative Genomics Viewer version 2.11.1^[462]118. We also observed
   a ~4.5 kb deletion was reported by cuteSV in our cohort.

Alignments of the human and chimpanzee reference genomes around the SVs at
the WWP2 and GSDMD loci

   We obtained homologous regions between the reference genomes of human
   (GRCh38) and chimpanzee (PanTro6, January 2018 assembly) at the WWP2
   and GSDMD loci using Lift Genome Annotations^[463]77. In addition to
   the sequence within each SV, we also extracted 2000 bps upstream and
   downstream of the SV breakpoints. We then aligned the sequences using
   the NCBI online BLASTN tool
   ([464]https://blast.ncbi.nlm.nih.gov/Blast.cgi).

Positive selection scan of the SVs in global populations

   We downloaded the SV calls in a previous study of SV diversity in
   global population using LRS data^[465]3 from
   [466]https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2
   /release/v2.0/integrated_callset/variants_freeze4_sv_insdel_alt.vcf.gz.
   We extracted a total of 102,118 insertions and deletions located in
   autosomes using BCFtools (version 1.17)^[467]103 view command. We
   merged the insertion call at the WWP2 locus with this dataset, since
   the deletion at the GSDMD locus is reported.

   To test whether the SVs at WWP2 and GSDMD were under positive selection
   in global populations or not, we randomly selected 50 samples per
   population in the 1000 Genomes Project and downloaded their
   high-coverage bam files from
   [468]https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high
   _coverage/^[469]46 (Supplementary Data [470]10). Additionally, 50
   samples were randomly chosen from the independent cohort in our present
   study. We genotyped all the SVs in the merged dataset using Paragraph
   version 2.3 with default parameters^[471]47. The F[ST] were calculated
   between the African populations and each of the non-African populations
   using VCFtools version 0.1.16^[472]119 with the following parameters
   “–weir-fst-pop African_population.txt –weir-fst-pop
   non_African_population.txt”. We then analyzed the percentile
   distribution of F[ST] for SVs at WWP2 and GSDMD by population.

PCR and Sanger sequencing of the SVs at the WWP2 and GSDMD loci

   One sample carrying heterozygous SV at WWP2 locus and one sample
   carrying heterozygous deletion at GSDMD locus were used to validate the
   SVs at the WWP2 and GSDMD loci. PCR primers were designed outside of
   the SV region. Therefore, we can get wild-type and SV-affected
   sequences for the downstream luciferase reporter assays. Specifically,
   primers with homology arms of pGL3-Promoter vector were designed using
   Primer-BLAST
   ([473]https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi) and
   were synthesized by TsingKe Biotech Corp (Shanghai, China). The
   homology arms were designed to have sequence complementarity with
   regions adjacent to the XhoI restriction site on both the target vector
   and the DNA fragment to be inserted. The empty pGL3-Promoter vector was
   kindly provided by Dr. Xiaoyang Zhang at Fudan University. Considering
   the sequences around the WWP2 locus are highly repetitive and contain
   SINE elements, the target region was amplified using touchdown PCR to
   improve amplification efficiency and reduce non-specific amplification.
   Specifically, the PCR amplification cycling consisted of an initial
   denaturation step at 95 °C for 3 min, followed by 30 cycles of the
   following steps: denaturation at 95 °C for 15 s, annealing at
   temperatures ranging from 62 °C to 55 °C for 15 s, decreasing the
   annealing temperature by 1 °C per cycle for the first few cycles, and
   finally, extension at 72 °C for x minutes (1–2 kb/min). 2* Phanta Max
   Master Mix (Dye Plus) (Vazyme, #P525-01) was used, with 50 ng of
   genomic DNA (Homo sapiens) as the template in a final reaction volume
   of 50 μl. All PCR products were subjected to agarose gel
   electrophoresis. Trans2K® Plus II DNA Marker (Trans, #BM111-01) was
   used as molecular weight standards for bands of PCR products.

   We performed Sanger sequencing based on the PCR products at the WWP2
   and GSDMD loci. The sequence at the WWP2 locus is:

   ATGGGTAGCGACGGCAGTTGATCTGAACTCAGGATCACTCGTGTTACACTGCAATCGCGTGTCGCCCTTTC
   AGAGACCGCTAAAGACAGCCTGAAATCCCAAAGCCTGGCATCTGGTGCCAATCAAAAAAACAGTAGTCGCT
   GATGTAACTGCAACTCGATCAGGGCAAAATGAAACAACGGCTTGCATTTGGCGGCTGGAACTGGGCTCAGC
   CCATAGCCTCGCCGGAGTGGGCAGCACGCACCACACTGGTCCTGCTGTCAGCCTCCGAGGTGAGCCGGACA
   CTGGCTGCAAAGACCTAAGTCACAGGAAAAGACGTTCATCCTCAGTAATTAGAGGAGAAAGTCACTCTTTC
   AACGTCGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATGCATGTATTCATGGTTTTGTGTTGTAGGTACAC
   AAAAGTTGTTCACTTTTGTGTATGAAGTCATTGCAAACACAAGCACAAATATACACATAAAAGCTACAGAT
   GCAAGGCCGAGCACAGTGGCTCATGCCTGTAATCCCAGAGCCTGAGGTGGGTGGATCACCTGAGCTCAGGA
   GTTCGAGACCAGCCTGGCCAACAAAGTGAAACCCCATCTCTACCAAAAATACAAAAATTAGTCAGGCATGG
   TTGTGCATGCCTGTAATCCCAACTACTAAAAATACAAAAAAATTAGCCAAGCACAGTGGCACATGCCTTTA
   ATCCCAGCTACTCAGGAGGCGGAGGCAGGAGAATCACTTGAACCCAGGAGGGCGGAGGTTGCAGTGAGCTG
   AGATGCACCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCCATCTCAAAAAAATAATAATAACAAAAA
   TAAAGGAGACAGGGGTCCCACTCTGTCACCAGGCTGGAGTACAGTGGCATGATCACAGTCACTGGATCTCG
   ACACCCAA (strand: minus)

   The sequence at the GSDMD locus is:

   GGACTTAATGTAGTGGGCCTCTGTGTTCCTGTGTGTGCACATGTCGCTGTGTGTGTTCATGTGCAGGTCAC
   GGTGCGCAGCCGGTGCCACGCCCGTGGCTGTCCGCTGCAGGAAGGGGAGCCCATCTGGGCCTCCTCCCTCC
   TCCCCTCCTCCCGCTTTTCTCCTTTCTTTTGCAGCAAAATTCCTGGAAAGCCCTGTACACTGCACTCCAGC
   CTGGGTGACAGAGCAAGAATCCATCTCTAAAATAAAAATATACTTTCTCCCAAGGTCCCCAGCCCAGAGGG
   TGTGAGGGCCTTGCAGTGGGAGGTAGGTGTCACTGCGCATCCGTGACAGTGGGGAGAGTGGGATGAGGGGG
   ACCCACCAGACCTCTAGAGCAGTTTTCTCCCACTGTCACTTTCTCCCTCCATAAAAGGGG (strand:
   minus)

Quantitative real-time reverse transcription (RT-qPCR)

   We examined the impacts of the SVs on the expressions of GSDMD and WWP2
   based on RNAs of a Chinese quartet family, the parents and two
   identical twins ([474]http://chinese-quartet.org/)^[475]76. Total RNAs
   were extracted from whole blood. cDNAs were reversed-transcribed using
   a HiScript® III RT SuperMix for qPCR (+gDNA wiper) kit (Vazyme,
   #R323-01), followed by amplification of cDNA using ChamQ Universal SYBR
   qPCR Master Mix (Vazyme, #Q711-02). The relative mRNA levels of genes
   were quantified using the 2^-ΔΔCt method, with normalization to GAPDH.
   Primers are listed in Supplementary Data [476]11.

Recombinant plasmids construction

   DNA fragments were prepared by PCR amplification or synthesized by
   Tsingke Biotech Corp (Shanghai, China). The recombinant plasmids were
   constructed by homologous recombination using ClonExpress II One Step
   Cloning kit (Vazyme, #C112) according to the manufacturer’s
   instructions. The reaction mixtures were then transformed into F-DH5α
   (WeiDi, #WD-DF1001M) using heat shock. Transformed cells were selected
   on agar plates containing ampicillin. Positive colonies were screened
   and sequenced in both directions by Sanger sequencing to ensure
   correctness. Plasmids from the positive colonies were isolated using
   TIANprep Mini Plasmid Kit (TIANGEN, #DP103).

Cell culture, transient transfections and luciferase reporter assays

   The HeLa cell line (CCL-2) was obtained from the American Type Culture
   Collection (ATCC) and cultured in Dulbecco’s modified Eagle’s medium
   (DMEM) (BasalMedia, D211114), which was supplemented with 10% fetal
   bovine serum (FBS) (Life-iLab, #AC03L055) and 1%
   penicillin/streptomycin (Life-iLab, #AC03L332). ATCC is a trusted and
   widely recognized source of authenticated cell lines. Thus, additional
   authentication was not performed in our laboratory. HeLa cells were
   maintained at 37 °C with 5% CO[2] and were plated on 12-well dishes
   24 h prior to transfection and grown to approximately 50% confluency.
   The cells were transfected with 1 μg of enhancer plasmids and 0.1 μg of
   Renilla plasmid as an internal control of transfection efficiency using
   EZ Trans RNA transfection reagent (Life-iLab, #AC04L051) according to
   the manufacturer’s instructions. Empty pGL3-Promoter vector was
   co-transfected with Renilla as control. Cells were harvested 48 hours
   after transfection. The cell pellet was fully lysed in reporter lysis
   buffer (Beyotime, #RG027) and the supernatant was acquired after
   centrifuging at 10,000 g for 5 min. The luciferase activity was
   measured with the dual luciferase reporter gene assay kit (Beyotime,
   #RG027) using Synergy™ 2 Multi-Mode Microplate Reader (BioTek). Each
   assay was performed in triplicate. The relative luciferase activity was
   determined by dividing the Relative Light Units (RLU) value of Firefly
   luciferase by the RLU value of Renilla luciferase based on the
   manufacturer’s prescribed method.

Mice experiments

   Mice (strain: C57/Bl6) were raised and maintained in a barrier
   facility. All animal experiments were reviewed and approved by the
   Institutional Animal Care and Use Committee of the East China Normal
   University, and conducted according to institutional guidelines. The
   mice were housed in a Specific Pathogen-Free (SPF)-grade animal
   facility under controlled environmental conditions. The ambient
   temperature was maintained between 20 °C and 27 °C with a temperature
   variation of less than 2 °C, and the relative humidity was controlled
   at 40–70%. A 12-h light/dark cycle (12 h light/12 h dark) was
   implemented to ensure a stable circadian rhythm. The illumination in
   the housing area was kept between 15 and 30 lux to minimize stress on
   the animals. Sterile-grade bedding and feed specifically designed for
   laboratory animals were used, and routine maintenance was conducted to
   ensure a clean environment.

   Corresponding littermates were used as WT controls in all of the
   experiments performed with KO mice. We identified the homologous
   regions in the mice genome using the liftOver tool in the UCSC Genome
   Browser website^[477]77. Wwp2 and Gsdmd mutant mice were generated by
   co-injection of Cas9 mRNA (100 ng/μl; ThermoFisher, #A29378) and sgRNA
   (50 ng/μl) at the Mouse Core of East China Normal University. sgRNAs
   were generated using the Guide-it™ sgRNA In Vitro Transcription Kit
   (Takara #632635), with their sequences listed in Supplementary
   Data [478]11. Genomic DNA was extracted from newborn F0 mice’s toes or
   tails one week after birth for sequencing, with genotyping primers
   listed in Supplementary Data [479]11. For the cisplatin-induced injury
   model, 8-week-old gender-matched wild-type and Gsdmd enhancer
   homozygous KO mice received intraperitoneal one-time injections of
   cisplatin (25 mg/kg) (Cayman Chemical, #13119) and were euthanized
   three days later.

microCT analysis

   For micro-CT analysis, the proximal femur and skull of 7-week-old mice
   were scanned using a Micro-CT scanner (Skyscan 1272, USA Bruker) with
   an X-ray energy of 450 μA/50 kV and an isometric resolution of 9 μm.
   The trabecular bones were assessed for BMD, BV/TV ratio, trabecular
   number, trabecular thickness, and trabecular separation/spacing using
   the CT Evaluation Program.

Histopathology analysis

   Kidneys were harvested from mice, rinsed in PBS, fixed in 10% formalin,
   and then embedded in paraffin. Paraffin-embedded sections were stained
   with hematoxylin and eosin (H&E) (Servicebio, #G1005) and terminal
   deoxynucleotidyl transferase dUTP nick-end labeling (TUNEL)
   (Servicebio, #G1507) to analyze the histology of samples.

Statistical analysis

   Statistical analyses were performed using GraphPad Prism software
   version 8.0.2 (GraphPad Software Inc., La Jolla, CA) and R version
   4.1.0. P values less than 0.05 were considered statistically
   significant.

Reporting summary

   Further information on research design is available in the [480]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [481]Supplementary Information^ (2.9MB, pdf)
   [482]41467_2025_56661_MOESM2_ESM.pdf^ (99.2KB, pdf)

   Description of Additional Supplementary Files
   [483]Supplementary Data 1^ (46.4KB, xlsx)
   [484]Supplementary Data 2^ (20KB, xlsx)
   [485]Supplementary Data 3^ (14.3KB, xlsx)
   [486]Supplementary Data 4^ (18.9KB, xlsx)
   [487]Supplementary Data 5^ (11.3KB, xlsx)
   [488]Supplementary Data 6^ (11.1KB, xlsx)
   [489]Supplementary Data 7^ (10.5KB, xlsx)
   [490]Supplementary Data 8^ (37MB, zip)
   [491]Supplementary Data 9^ (11.1KB, xlsx)
   [492]Supplementary Data 10^ (32.5KB, xlsx)
   [493]Supplementary Data 11^ (12.1KB, xlsx)
   [494]Reporting Summary^ (3.9MB, pdf)
   [495]Transparent Peer Review file^ (4.7MB, pdf)

Acknowledgements