Abstract

   With ongoing improvements in the detection of complex genomic and
   epigenomic variations, long-read sequencing (LRS) technologies could
   serve as a unified platform for clinical genetic testing, particularly
   in rare disease settings, where nearly half of patients remain
   undiagnosed using existing technologies. Here, we report a simplified
   funnel-down filtration strategy aimed at enhancing the identification
   of small and large deleterious variants as well as abnormal
   episignature disease profiles from whole-genome LRS data. This approach
   detected all pathogenic single nucleotide, structural, and methylation
   variants in a positive control set (N = 76) including an independent
   sample set with known methylation profiles (N = 57). When applied to
   patients who previously had negative short-read testing (N = 51),
   additional diagnoses were uncovered in 10% of cases, including a
   methylation profile at the spinal muscular atrophy locus utilized for
   diagnosing this life-threatening, yet treatable, condition. Our study
   illustrates the utility of LRS in clinical genetic testing and the
   discovery of novel disease variation.

   Subject terms: Medical genomics, Genetic variation
     __________________________________________________________________

   Here the authors illustrate how long read whole genome sequencing can
   enhance the diagnosis of patients with rare diseases relative to
   standard methods, through identification of comprehensive small and
   large genomic variation on both alleles as well as methylation
   profiling.

Introduction

   Around 7000 rare diseases have been identified, collectively imposing
   significant health and socio-economic burden^[56]1. The majority of
   these diseases have a genetic origin due to variants ranging from
   single nucleotide variants (SNVs) or a few nucleotide
   insertions/deletions (INDELs) to large genomic changes such as copy
   number variants (CNVs), translocations, inversions, transposable
   element (TE) insertions, or complex rearrangements. Some are also
   associated with specific epigenomic profiles^[57]2. This diverse
   spectrum of disease-causing changes, often detected by different
   technologies, has challenged current genetic diagnostic strategies and
   contributed to long diagnostic odysseys, averaging at 6 years^[58]3,
   delaying timely management or treatment plans for patients with rare
   diseases.

   Although short-read sequencing technologies have brought a remarkable
   leap in the diagnosis of rare genetic diseases^[59]4,[60]5, more than
   half of the patients remain undiagnosed. This is partly due to the
   inherent limitations of this technology in detecting complex variants
   such as structural variants, methylation profiles, repeat expansions,
   or variants embedded in inaccessible regions of the genome,
   specifically high homology and GC-rich regions^[61]6. Recent advances
   in third-generation sequencing technologies have demonstrated the
   application of targeted LRS for identifying pathogenic variants in
   known or novel disease-causing genes^[62]7–[63]10. However, the
   clinical implementation of LRS for detecting genome-wide variation and
   methylation changes in the context of rare diseases has been limited by
   challenges associated with the annotation and filtration of a large
   number of variants and is, therefore, yet to be explored. Here, we
   optimize a whole genome LRS workflow and a filtration strategy which we
   apply in a cohort of undiagnosed patients with suspected rare diseases
   leading to additional diagnoses and the uncovering of a methylation
   signature utilized for the diagnosis of Spinal Muscular Atrophy (SMA).

Results and discussuion

   We optimized our analysis workflow on a selected cohort of 17 patients
   with confirmed genetic diagnoses, encompassing a diverse array of
   genomic and epigenomic pathogenic variants (Fig. [64]1a and
   Supplementary Fig. [65]1a). The study design incorporated wet bench
   protocol optimized for long-read Oxford Nanopore sequencing using
   PromethION system targeting a minimum of 30X coverage with average N50
   of 12 kb (Fig. [66]1a and Supplementary Fig. [67]1b). Our computational
   analysis workflow consists of a “genome” and “epigenome” modules
   (Fig. [68]1a and Supplementary Methods). The former module consists of
   detection, annotation, and selection of short variants (SNVs and
   INDELs) (below) and genome-wide rearrangements, mainly copy number
   variations (CNVs) and structural variations (SVs). Raw variants were
   retained if calls were supported by ≥ 5 reads with allele fraction ≥0.3
   and were affecting the coding region of genes associated with disease
   as defined in OMIM or GeneCC (Supplementary Methods). This reduced the
   number of called variants by 58.8% for CNVs and 99.2% for SVs. Further
   filtering of variants unique to each patient in the cohort reduced CNVs
   by 97.3% (average n = 3) and SVs to 99.9% (average n = 46) (Fig. [69]1b
   and Supplementary Fig. [70]1c), which were then manually inspected for
   any clinical correlation. This led to the detection of all associated
   pathogenic variants in this group (Supplementary Fig. [71]1d–fand
   Supplementary Fig. [72]2a, b).

Fig. 1. Study design, proof of concept in positive cohort and overview of
negative cohort.

   [73]Fig. 1
   [74]Open in a new tab

   Shown are (a) the study design schema, along with the wet bench and
   bioinformatics workflows with the tools used. Information about the
   positive (N = 17) and negative (N = 51) samples are in Supplementary
   Fig. [75]1a and Supplementary Data [76]2. Samples used for optimization
   (N = 59) include two samples with SMN1 carrier status and 57 samples
   set with previously confirmed methylation profiles, as explained in the
   main text. b counts of CNVs and SVs for each method in each filtering
   step of the “funnel-down” approach. c aggregate methylation profile of
   SMA –heatmap of base methylation modification (%) within the SMN1
   chr5:70239954-70249165 region (left panel) and IGV methylation view in
   SMN1 (right panel) for SMA positive (OXN-007, OXN-008, OXN-010,
   OXN-068, OXN-069), SMA carrier (OXN-070, OXN-071) and SMA negative
   samples (OXN-012, OXN-021). d the gender and demography and (e) the
   most prevalent primary clinical symptom in the “Negative Cohort”.
   Source data are provided as a Source Data file.

   The epigenomic module “Epimarker” scans episignatures specific to 36
   Mendelian neurodevelopment disorders (MNDD)^[77]2, including Angelman
   syndrome and Spinal Muscular Atrophy (SMA) (see below). This module
   correctly detected loss of methylation at 15q11.2 in a patient with
   Angelman syndrome (OXN-18), while SNP analysis revealed substantial
   loss of heterozygosity (LOH) across chromosome 15, suggesting paternal
   uniparental disomy (pUPD) as the underlying mechanism of disease in
   this patient (Supplementary Fig. [78]1e, f). We also optimized
   Epimarker using an independent LRS dataset (N = 57) comprising 9 MNDD
   in a cohort of 17 patients, with clinically confirmed abnormal
   methylation profiles, along with 40 controls^[79]10. Epimarker
   classified all patients with 100% sensitivity, while none of the
   control samples were assigned to MNDD (100% specificity) (Supplementary
   Data [80]1 and Supplementary Methods). We further observed a
   methylation profile across the SMN1 gene, whose biallelic loss causes
   spinal muscular atrophy (SMA), and its homologous pseudogene, SMN2, and
   explored the utilization of this methylation pattern as a diagnostic
   marker for the disease (Fig. [81]1c) We observed 0–15% (low), 50–70%
   (moderate) and 98–100% (high) of bases with methylation modification
   for SMA patients (N = 5), carriers (N = 2) and non-carriers (N = 3),
   respectively, in the locus spanning intron 6, exon 7 and intron 8
   of the SMN1 gene (chr5:70239954-70249165) (Fig. [82]1c and
   Supplementary Fig. [83]2a). SMA is a common, life-threatening autosomal
   recessive neuromuscular disease mostly caused by biallelic deletion of
   exon 7 in SMN1^[84]11. We investigated this finding by deconvoluting
   reads in SMN1 and SMN2 based on 16 paralog-specific variants
   (PSVs)^[85]12,[86]13 (Supplementary Methods and Supplementary
   Fig [87]2b). Upon deconvolution, we confirmed the biallelic loss of
   SMN1 in SMA patients, where no specific reads were mapped to this gene;
   this finding was further confirmed by droplet digital PCR
   (Supplementary Fig. [88]2c). Notably, SMA carriers had a reduced number
   of reads in SMN1 relative to non-carriers (Supplementary
   Fig. [89]2b–d). Taken together, we propose a workflow where methylation
   spanning SMN1 introns 6 and 8 (including exon 7) can be used as a “tag”
   for SMA diagnosis and carrier status determination, which can then be
   confirmed upon LRS read deconvolution using SMN1 and SMN2 PSVs at this
   locus (Supplementary Fig. [90]2b). Overall, our pipeline was able to
   correctly identify all the pathogenic variants, including complex
   rearrangements and aberrant methylation, in the optimization cohort.

   We applied this workflow to a set of undiagnosed patients (N = 51), who
   previously had inconclusive testing using short read whole exome
   sequencing (WES). Among them, 41% had undertaken multiple genetic
   testing, of which 86% had received chromosomal microarray (CMA) testing
   (Fig. [91]1a and Supplementary Data [92]2). Patients were mostly of
   Arab descendant (90%), had overall equal gender representation (~ 44%
   females), and primarily presented with neurological disorders (45%)
   (Fig. [93]1d, e and Supplementary Data [94]2). Whole genome LRS in this
   cohort obtained an average of 49X coverage and N50 of 11.7Kb
   (Supplementary Fig. [95]1b). Since all the samples were previously
   tested by WES, we focused on SNVs within exons that were missed by WES
   and those in the 50 bp exon-intron boundary by assessing their splicing
   potential (Supplementary Data [96]3). We detected ~ 47,000 LRS-specific
   exonic and ~ 41,000 splicing SNVs comprising 1.8% of the total detected
   SNVs for each patient (Supplementary Data [97]3). We applied our SNV
   filtration criteria (Supplementary Fig. [98]3a) resulting in
   approximately ~ 2 LRS-specific exonic and ~ 5 splicing SNVs for manual
   inspection in accordance with ACMG guidelines^[99]14 (Supplementary
   Methods and Supplementary Fig. [100]3b). We then evaluated variants
   within genes associated with diseases matching patients’ phenotypes and
   identified a single variant in DNMT1 ([101]NM_001130823:
   c.891 + 8 C > T) in OXN-044 with a highly predicted splicing impact
   (SpliceAI score = 0.93). Indeed, transcriptomic sequencing using RNA
   extracted from peripheral blood in this patient confirmed a splicing
   defect whereby the c.891 + 8 C > T variant introduced a cryptic donor
   splice site leading to intronic retention of six nucleotides
   (Supplementary Fig. [102]3cand Supplementary Data [103]4, [104]5 and
   [105]6). However, this change introduced two in-frame amino acids and
   is therefore unlikely to affect protein function as corroborated by the
   high allele frequency of the c.891 + 8 C > T variant in the general
   population and specifically in the Middle East (6.5% allele frequency
   with 25 homozygotes in gnomAD v4.1.0). Therefore, this variant was
   classified as clinically benign. No other putative clinically relevant
   sequence variants were identified.

   We next focused on larger genomic rearrangements and detected ~ 35,000
   SVs and ~ 83 CNVs in each sample (Supplementary Data [106]3). These
   were substantially reduced by 98.5% and 99.9%, respectively, after
   applying our filtering and selection criteria (Fig. [107]2a,
   Supplementary Fig. [108]3a and Supplementary Data [109]3). Within
   filtered large CNV events, we identified pathogenic variants in two
   patients. For patient OXN-033, two deletions from a total of 59 CNVs
   were prioritized, of which a heterozygous deletion event (1.4 Mb) at
   2q11.1-q11.2 was classified as pathogenic post manual inspection, was
   validated by CMA and found to be de novo upon parental testing
   (Fig. [110]2b and Supplementary Data [111]7). Individuals with 2q11.2
   deletions have developmental delay, intellectual disability, dysmorphic
   features, and variable skeletal anomalies along with obesity^[112]15
   which was consistent with this patient’s phenotype. The other
   prioritized heterozygous deletion in this patient was 1.4 Mb in size at
   15q13.1-q13.2 as confirmed by CMA. The only known disease-causing gene
   in this region is NSMCE3 which has been associated with autosomal
   recessive immunodeficiency and lung disease (MIM# 617241). Therefore,
   this deletion, confirmed to be paternally inherited, was not considered
   to be diagnostic in this patient. In another patient (OXN-048), with
   unconfirmed diagnosis of anterior segment dysgenesis and a heterozygous
   pathogenic variant in the SLC38A8 gene identified by exome sequencing,
   we detected a single heterozygous deletion (80 kb) at 16q23.3
   (Fig. [113]2c and Supplementary Data [114]7), partially encompassing
   SLC38A8 (exons 8 – 3’UTR), using LRS. SLC38A8 is associated with
   autosomal recessive foveal hypoplasia and/or anterior segment
   dysgenesis matching the patient’s phenotype^[115]16. Taking advantage
   of the long reads, we phased the two variants and observed that each
   variant is in a distinct haplotype confirming the compound heterozygous
   configuration in this individual and the biallelic impairment of
   SLC38A8 (Fig. [116]2c).

Fig. 2. Detected pathogenic variants in patients, with previously negative
testing, leading to confirmed diagnoses.

   [117]Fig. 2
   [118]Open in a new tab

   a Reported on the right side of the graph are the detected numbers of
   genomic variants (CNVs and SVs), and on the left side, the high
   confidence variants, post funnel down filtering where samples with
   confirmed diagnoses, due to CNVs (white line graph) or SVs (gray bars),
   highlighted in orange. b deletion event at 2q11.1-q11.2 in OXN-033
   identified by LRS (shown in IGV as a decrease in coverage, in green, in
   the top panel) and validated by CMA (bottom panel showing log[2] ratios
   or B-allele frequency (0, 0.5, or 1), in purple, of copy number or SNP
   probes, respectively. Red bar marks the deletion). c Phased genomic
   alignment with allele-specific INDEL and large deletion in SLC38A8 in
   OXN-048 (left top panel) with a zoomed view of INDEL (bottom panel) in
   IGV. Coverage shown as green dots, and red bars represent intragenic
   deletion (top) or INDEL (bottom). CMA profile (right top panel showing
   log[2] ratios or B-allele frequency (0, 0.5, or 1), in orange, of copy
   number or SNP probes, respectively. Red bar marks the deletion) and PCR
   gel electrophoresis (bottom panel; refer to Methods) corroborating the
   finding. d homozygous deletion in the 3’ UTR of MPLKIP detected by LRS
   (IGV alignment, top panel), validated by PCR (bottom left panel; refer
   to Methods) with significant difference in the normalized gene
   expression (bottom right panel) between Control (n = 6; two independent
   biological samples each repeated 3 times) and OXN-027 (n = 3; same
   sample repeated 3 times) as determined by transcriptomic sequencing
   (refer to Methods). Box plots lower and upper bounds are 25% and 75%
   percentile, respectively; the line represents the median, and the
   whiskers show the minima and maxima of data points. P-value was
   calculated using the Wilcoxon two-tailed test. e left, heatmap with
   methylation profile across negative cohort (including OXN-62) and a
   published HMA control sample (left panel); right, duplication event at
   5q35.2-q35.3 in OXN-062 detected by LRS (IGV coverage view in green,
   top right panel) and validated by CMA (bottom right panel showing
   log[2]ratios or B-allele frequency (0, 0.5, or 1), in blue, of copy
   number or SNP probes, respectively. Blue bar marks the duplication). f
   heatmap of base methylation modification (%) within the
   chr5:70239954-70249165 region across the negative cohort, SMA positives
   (OXN-007, OXN-008, OXN-010, OXN-068, OXN-069) and carriers (OXN-070,
   OXN-071) (left panel). IGV methylation view post-read deconvolution
   (center panel) and copy numbers (bar graphs, right panel) of SMN2 (dark
   green) and SMN1 (light green) detected by ddPCR. Source data are
   provided as a Source Data File. All PCR gels are performed with a
   1500 bp ladder, and units are in base pairs (refer to “Methods”).

   We then examined the landscape of structural variants. We identified a
   homozygous deletion of 3.6 kb detected by both Sniffles and CuteSV,
   partially including the 3’ untranslated region (UTR) of the gene
   encoding the M-Phase Specific PLK1 Interacting Protein (MPLKIP) in
   patient OXN-027 (Fig. [119]2d and Supplementary Data [120]7). This
   patient showed signs of learning disabilities with distinctive brittle
   hair, a hallmark of Trichothiodystrophy nonphotosensitive 1 associated
   with non-functional MPLKIP protein. The 3’UTR region is known to
   regulate mRNA-based processes^[121]17, hence we hypothesized that the
   homozygous 3’UTR deletion of the MPLKIP gene could alter its expression
   levels. In fact, transcriptomic analysis, using RNA extracted from
   peripheral blood in this patient, showed that this gene is
   significantly overexpressed (Fig. [122]2d) in this patient suggesting
   that its dysregulation might underlie the observed phenotype,
   especially in light of the MPLKIP protein role as a cell cycle and
   mitosis regulator where its function might be dependent on its
   expression levels. However, further investigation is required to
   confirm this mechanism and to understand the functional impact of the
   3’UTR deletion in this gene.

   We next used “Epimarker” to compare the methylation patterns of the 51
   undiagnosed patients with the episignature profiles associated with 34
   MNDD^[123]2. One patient (OXN-062) was classified, based on its
   methylation profile, as having Hunter McAlpine syndrome (HMA) with a
   duplication at 5q35.2-q35.3 containing the NSD1 gene detected by LRS
   CNV analysis and validated by chromosomal microarrays (Fig. [124]2e and
   Supplementary Data [125]7). HMA is characterized by craniosynostosis,
   intellectual deficit, short stature, and facial dysmorphism matching
   the clinical indication of the patient. While deletions of NSD1 and
   hypomethylation at this locus are associated with Sotos syndrome, HMA
   has been associated with micro-duplication involving NSD1 and a
   hypermethylation profile^[126]2 confirming the diagnosis for this
   patient. We then examined the SMA methylation tag described above
   across all undiagnosed patients. Interestingly, we observed a
   methylation profile consistent with biallelic loss of SMN1 in the
   patient (OXN-060). Reads deconvolution analysis as well as droplet
   digital PCR confirmed SMA diagnosis in this patient who also had 4
   copies of the SMN2 gene (Fig. [127]2f).

   The protocols for analyzing LRS are still in nascent stages, and no
   global standard methods have been established. In this study, we aimed
   to establish a comprehensive diagnostic workflow for LRS to investigate
   the genomic and epigenomic landscape in rare disorders focusing on
   variants disrupting disease-causing genes or loci. We propose a
   filtering strategy which substantially reduces the number of variants
   detected by whole genome LRS while capturing a wide spectrum of genomic
   and epigenomic pathogenic variation, leading to 10% (5 out 51)
   additional diagnoses in patients with rare diseases who had
   inconclusive testing using traditional methods. We acknowledge that our
   approach might have reduced sensitivity for several reasons, including
   the possibility of filtering out large SV events in noncoding regions,
   which might still be causative through several mechanisms such as
   positional effects or the disruption of topologically associating
   domains (TADs). However, such events would still require functional
   validations which might not yet be part of routine clinical testing. In
   two patients, for example, transcriptomic analysis was needed to
   confirm an RNA splicing effect (OXN-044) or to assess an expression
   outlier (OXN-027). We also developed an LRS-based “Epimarker” method to
   empirically profile patients for episignatures of 36 diseases in the
   clinical setting. We also uncover, for the first time, an SMA-specific
   methylation tag which was incorporated into our clinical “Epimarker”
   profiling. Taken together, our results demonstrate the potential of
   long-read sequencing as a single unified assay for routine clinical
   genetic testing and the discovery of novel rare disease variations.

Methods

Patient samples

   This study was reviewed and approved by the Dubai Scientific Research
   Ethics Committee, Dubai Health Authority (approvals no.
   DSREC-SR-03/2023_08 and DSREC-08/2024_09). Control DNA samples
   (N = 17), with known genomic or methylation aberrations (Supplementary
   Fig. [128]1a), were used for optimizing the library preparation,
   sequencing, bioinformatics analysis, and clinical annotation and
   filtration. The clinical utility of our approach was then evaluated on
   DNA from 51 patients with highly suspected monogenic disorders, and
   non-diagnostic short-read whole exome sequencing. 41% of those patients
   (N = 21) had undertaken multiple genetic testing, of which 86% (N = 18)
   had received chromosomal microarray (CMA) testing. All patients were
   consented for clinical genetic testing under an approved de-identified
   research protocol, which permits the publication of de-identified
   analyses.

Long read WGS library preparation and sequencing

   Genomic DNA was extracted from peripheral whole blood using the
   QIAsymphony DSP DNA Kit (Qiagen, Hilden, Germany) and QIAsymphony
   automated nucleic acid extraction instrument, according to the
   manufacturer’s instructions. 6000 ng gDNA was sheared with G-Tubes
   (Covaris LLC, USA) following the standard 20 kb protocol. The resulting
   DNA fragments were utilized for duplicate library preparation per
   sample using the Ligation Sequencing Kit V14 (Oxford Nanopore, UK),
   according to the manufacturer’s instructions. Libraries were sequenced
   on the PromethION P48 device with R10.4.1 flow cells (Oxford Nanopore,
   UK) for 72 h with a second library loaded at 24 h post flow cell
   washing.

mRNA library preparation and Transcriptome sequencing

   Transcriptome sequencing was performed for two patients and two
   controls (Supplementary Data [129]4). Total RNA was extracted and
   purified from human whole blood samples collected in Tempus blood RNA
   tubes using a Tempus spin RNA isolation kit (Applied Biosystems, US),
   according to the manufacturer’s instructions. 270–290 ng of total RNA
   was utilized for triplicate library preparation per sample using
   TruSeq® Stranded mRNA Library Prep kit (Illumina, USA), according to
   the manufacturer’s instructions. Libraries were sequenced on Illumina
   NovaSeq 6000.

Long-read sequencing data analysis

   A new pipeline appropriate for long-read nanopore technology was
   developed in-house using published software (see Supplementary Methods
   for details). Briefly, base calling was done using “high-accuracy base
   calling” (HAC) mode during the run using MinKnow distribution (version
   22.05.7) and Guppy (version 6.1.5). The methylation tag (MM,ML / mm,ml)
   was inferred using samtools (version 1.13) for all bam passed files and
   were aligned to the human reference genome (GRCh37/hg19) using minimap2
   (version 2.22-r1101). Epi2Me^[130]18 workflow wf-human-variation
   (v1.2.0), suitable for long read technology was used for the detection
   of the genomic variants using its module – ‘--cnv’, ‘--sv’, ‘--snp’ and
   ‘--methyl’, with default parameters except for CNVs that was run with a
   bin size of 5. CuteSV(v2.0.3) was applied in conjunction with
   identifying SVs. CNVs and SVs were annotated using ClassifyCNV(1.1.1)
   and AnnotSV(v3.2.3). A funnel-down approach was used to filter SVs and
   CNVs, where SVs with at least 5 supporting reads with allele frequency
   ≥ 0.3 and CNVs with log[2]fold change of 0.5 were used for downstream
   analysis. Variants overlapping coding regions of genes associated with
   disease as identified from OMIM and GeneCC database were retained, and
   those unique within the cohort and each method were correlated with
   patients’ phenotype using in-house scripts. Matching variants were then
   manually inspected to identify putative pathogenic ones.

   Methylation analysis was performed by comparing the methylation profile
   of the patients with those reported in literature for the epigenomic
   signature^[131]2. SMA detection was developed based on the methylation
   profile in the genomic region capturing introns 6 to 8
   (chr5:70,239,954-70,249,165) of SMN1, where few bases (0–10%) were
   observed to undergo methylation modifications indicating absence of
   SMN1.

   Single nucleotide variation (SNV) analysis was performed for variants
   not captured by whole exome sequencing as per ACMG guidelines^[132]14.
   In addition, variants within 50 bp annotated exon-intron boundary (NCBI
   Refseq transcripts for build hg19) were analyzed if the splice score as
   calculated from SpliceAI^[133]19 ≥ 0.7. Briefly, SNVs with genotype
   quality, read depth, and mapping quality greater than 10, 30, and 10,
   respectively with filter tag as “PASS” were selected. Rare variants,
   detected in ≤ 2 patients in the cohort, present in disease-associated
   genes as defined by HGMD, OMIM, and GeneCC, where 90% of the isoforms
   were affected by the variant were selected for manual inspection and
   correlation with patient phenotype.

Transcriptome sequence data analysis

   FastQC and MultiQC were used to assess sequencing read quality.
   High-quality reads (Q ≥ 30) were mapped to GRCh37 (hg19) using STAR
   (v2.7.8a) with the default settings. Gene count was performed using
   featureCounts from the SubReads (v2.0.1) with the ‘-p -O -g gene_id -s
   2’ parameters (Supplementary Data [134]5) and analyzed by DESeq2
   (v1.38.3) correcting for batch effects, normalization and differential
   gene expression analysis. Genes with adj p-value < 0.05 were identified
   as significant and selected for pathway enrichment analysis using the
   Enrichr web application (Supplementary Data [135]6). Additional
   statistical analysis was performed using the Fisher exact test to rank
   the top pathways.

Chromosomal microarray analysis

   Chromosomal microarray analysis was performed as previously
   described^[136]20. Briefly, CMA was done using the Affymetrix CytoScan
   HD^TM assay consisting of 2.67 million probes and analyzed using
   Chromosome Analysis Suite^TM software 4.0 to compare, in silico, the
   hybridization pattern of a patient specimen against a pooled reference
   sample set. Losses larger than 200 kb (with ≥ 25 probes) or gains
   larger than 400 kb (≥ 50 probes) are reported, along with smaller
   variants of pathogenic potential.

Droplet digital PCR analysis

   The copy numbers of SMN1 and SMN2 were determined by droplet digital
   PCR (ddPCR) technology as described previously^[137]20, using
   predesigned proprietary ddPCR assay kits for SMN1 (Catalog No:
   186-3500, Bio-Rad). In addition, experimental controls – 0 copy, 1
   copy, and 2 copy controls for SMN1 were included along with a no
   template control. Data analysis was performed using QuantaSoft version
   1.7.4.0917 (Bio-Rad) to determine the copy number variation (CNV).

PCR Gel electrophoresis

   To confirm the heterozygous deletion of approximately 80 kb at 16q23.3
   in patient OXN-048 (Fig. [138]2c), two primer sets of approximately
   20 bp in length were designed near the breakpoints (Supplementary
   Data [139]7). One set spans the full deletion (chr16:83971418-84052205)
   with an expected amplification of 500 bp only in carriers of the
   deleted allele (OXN-048). The other primer set is expected to amplify
   the 815 bp region from the upstream breakpoint through the deleted
   region (chr16:83971418-83972232). Gel electrophoresis analysis using a
   1500 bp ladder revealed an ~ 800 bp band for the unaltered allele
   (Father and OXN-048) and a ~ 500 bp band for the deleted allele in
   OXN-048 only (Fig. [140]2c).

   For the validation of a homozygous deletion of 3.6 kb in patient
   OXN-027 (Fig. [141]2d), two primer sets were designed (Supplementary
   Data [142]7), one spanning the full deletion (chr7:40164307-40168860)
   but expected to amplify a 952 bp PCR product in carriers of the deleted
   allele (OXN-27 and parents). The second primer set is expected to
   amplify a 1045 bp PCR product (chr7:40168214-40168860) spanning the
   second breakpoint, which cannot be detected in individuals homozygous
   for the deletion (OXN-027). PCR analysis revealed a ~ 950 bp band for
   the deleted allele in the patient and heterozygous parents. However, no
   band was observed for the patient (OXN-027) using the primer set within
   the deleted region, indicating a homozygous deletion (Fig. [143]2d).

   All primers were designed using the UCSC Genome Browser, selecting
   sequences with an optimal length of 20–25 bp, a melting temperature
   (Tm) of ~ 60 °C, and a GC content of less than 60%. The final product
   sizes were confirmed using UCSC In-Silico PCR. PCR products were run on
   Lonza FlashGel Device (Bioscience).

Statistical analysis

   Statistical details of experiments and analyses can be found in the
   figure legends and the main text. Differences between the two groups
   were performed using Wilcoxon rank-sum two-tailed test. Comparisons
   between multiple groups were performed using Wilcoxon pairwise
   two-tailed test with bonferroni correction. Exact P-values are reported
   unless smaller than 0.0001.

Reporting summary

   Further information on research design is available in the [144]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [145]Supplementary Information^ (1.5MB, pdf)
   [146]41467_2025_57695_MOESM2_ESM.pdf^ (96.1KB, pdf)

   Description of Additional Supplementary Files
   [147]Supplementary Data 1-7^ (45.6KB, xlsx)
   [148]Reporting Summary^ (72.2KB, pdf)
   [149]Transparent Peer Review file^ (195KB, pdf)

Source data

   [150]Source Data^ (263.7KB, xlsx)

Acknowledgements