Abstract

   Genomics for rare disease diagnosis has advanced at a rapid pace due to
   our ability to perform in-depth analyses on individual patients with
   ultra-rare diseases. The increasing sizes of ultra-rare disease cohorts
   internationally newly enables cohort-wide analyses for new discoveries,
   but well-calibrated statistical genetics approaches for jointly
   analyzing these patients are still under development. The Undiagnosed
   Diseases Network (UDN) brings multiple clinical, research and
   experimental centers under the same umbrella across the United States
   to facilitate and scale case-based diagnostic analyses. Here, we
   present the first joint analysis of whole genome sequencing data of UDN
   patients across the network. We introduce new, well-calibrated
   statistical methods for prioritizing disease genes with de novo
   recurrence and compound heterozygosity. We also detect pathways
   enriched with candidate and known diagnostic genes. Our computational
   analysis, coupled with a systematic clinical review, recapitulated
   known diagnoses and revealed new disease associations. We further
   release a software package, RaMeDiES, enabling automated cross-analysis
   of deidentified sequenced cohorts for new diagnostic and research
   discoveries. Gene-level findings and variant-level information across
   the cohort are available in a public-facing browser
   ([59]https://dbmi-bgm.github.io/udn-browser/). These results show that
   case-level diagnostic efforts should be supplemented by a joint genomic
   analysis across cohorts.

   Subject terms: Statistical methods, Diseases, Data integration
     __________________________________________________________________

   Using well-calibrated statistical methods the authors jointly analyze
   Undiagnosed Diseases Network genomes, identifying known and novel
   disease genes. Software is publicly available to support future
   cross-cohort rare disease discovery efforts.

Introduction

   For decades preceding the widespread application of DNA sequencing,
   identifying the genetic etiology of rare monogenic phenotypes including
   human diseases relied on segregation in pedigrees^[60]1. DNA sequencing
   enabled the analysis of sporadic cases with no segregation data^[61]2.
   Early studies analyzed small cohorts of phenotypically similar
   cases^[62]3,[63]4, a highly successful approach that is, however,
   limited to diseases with multiple known patients with fairly
   homogeneous presentations. In the absence of such phenotypically
   matched case cohorts, per-case studies of individual, undiagnosed
   patients are gaining popularity^[64]5–[65]8. By design, these studies
   cannot attain statistical power from the shared genotypes of unrelated
   patients and require extensive clinical and biological inquiry to prove
   the causal involvement of the genotype in disease^[66]9–[67]11. The
   most recent phase of human Mendelian genetics employs a data science
   approach to gene discovery propelled by the joint genomic analysis of
   phenotypically broad cohorts. Recent studies by the Deciphering
   Developmental Disorders and 100,000 Genomes consortia have demonstrated
   the power of this approach to identify new diagnoses and disease
   genes^[68]12–[69]14. This opens the prospect of international
   cross-cohort analyses, leveraging parallel efforts in many countries,
   and appreciating that rare diseases know no borders.

Undiagnosed diseases network dataset

   Here, we apply existing and newly developed statistical genetics
   methods to the Undiagnosed Diseases Network (UDN) cohort that includes
   extremely difficult-to-solve, likely genetic cases (Fig. [70]1a–e). The
   unique, diagnostically elusive presentation is the only criterion for
   inclusion, and patients have varied presentations including
   neurological, musculoskeletal, immune, endocrine, cardiac, and other
   disorders. Symptom onset ranges from neonatal through late adulthood.
   In contrast to most existing rare disease cohorts, individuals accepted
   to the UDN have already undergone lengthy but ultimately unfruitful
   diagnostic odysseys prior to enrollment. These patients subsequently
   undergo extensive phenotypic characterization at UDN clinical
   sites^[71]15. Both broad Human Phenotype Ontology (HPO) terms and
   highly detailed clinical notes are collected and made available for all
   UDN researchers. Phenotypic information includes laboratory
   evaluations, dysmorphology examinations, specialist assessments,
   surgical records, and imaging (Fig. [72]1f).

Fig. 1. Undiagnosed diseases network cohort analysis.

   [73]Fig. 1
   [74]Open in a new tab

   a Map of clinical and research sites within the Undiagnosed Diseases
   Network (UDN) for evaluating patients and candidate variant
   functionality. Map created using R’s usmap package. b Genetic ancestry
   across the sequenced patient cohort. c Clinician-recorded primary
   symptom categories of patients. “Multiple” indicates 2+ categories
   could be considered primary and “other” indicates an unlisted category.
   Categories marked with an asterisk (*) are neurological subtypes
   (Supplementary Note [75]1). d Patient-reported age of first symptom
   onset. e Patient sex. f Categories and quantity of phenotype
   information collected for patients and made available to all UDN
   researchers. g Intronic variants detectable from genome sequencing
   (orange star) with a predicted splice-altering impact are considered
   alongside exonic variants in our statistical framework; these variants
   may result in retained introns or excised exons in processed
   transcripts. h We consider genes and gene pathways harboring de novo
   and compound heterozygous variants in sequenced trios (72% of all
   accepted cases). Other inheritance modes (e.g., homozygous, uniparental
   disomy) are not considered in our cohort-based framework. Complete case
   count by family structure (e.g., proband-only, duo) is in Supplementary
   Fig. [76]2. i Depiction of clinical framework to uniformly evaluate how
   well a patient’s phenotypes are concordant with a candidate gene or
   variant. All icons in (f) and (i) are from Microsoft PowerPoint.

   There is a similar emphasis on collecting sequencing data, with whole
   genomes sequenced for probands and their immediate or otherwise
   relevant family members. Although smaller than some other rare disease
   cohorts^[77]14, the UDN—with a design bridging clinical, research and
   functional validation teams and a focus on extreme patient
   presentations—was thought to be optimized for case-based analyses where
   probands are evaluated individually in depth. Patients’ detailed
   phenotypic information, ongoing confirmation of new diagnoses, and the
   potential enrichment for novel genetic disorders make for an ideal data
   space to validate and develop statistical approaches. We harmonized and
   jointly called single nucleotide (SNV) and insertion/deletion (indel)
   variants across 4,236 individuals with whole genome sequencing in the
   UDN dataset and additionally called de novo mutations from aligned
   reads across complete trios (Methods, Supplementary Fig. [78]1)^[79]16.

Clinical evaluation of computational findings

   Here, we generate candidate gene–patient matches via a series of
   statistical genomic analyses implemented in our software suite, Rare
   Mendelian Disease Enrichment Statistics (RaMeDiES, Fig. [80]1g, h). We
   focus on the model of monogenic, autosomal inheritance in de novo and
   compound heterozygous cases to prioritize candidates via a
   genotype-first approach, with no clinical input or phenotypic
   information used. Each candidate is then evaluated with respect to the
   patient’s clinical presentation and the gene and variant’s putative
   role in disease—based on known disease associations, functionality in
   model organisms, tissue expression, molecular function, evolutionary
   constraint, and in silico predicted pathogenicity—to assess phenotypic
   match (Fig. [81]1i). For genes or gene pathways harboring deleterious
   variants across multiple individuals, phenotypic similarity between
   patients is also assessed. To scale clinical evaluation to the cohort
   level, we developed a semi-quantitative protocol guided by the ClinGen
   framework^[82]17 that uses hierarchical decision models to increase
   efficiency and enables consistent and comparable evaluations of a
   gene–patient diagnostic fit by independent experts (Supplementary
   Note [83]2, Supplementary Fig. [84]3). We calibrated the protocol
   during development by testing whether the resulting clinical scores
   assigned by different experts on the clinical team were in agreement.
   We validated the protocol in a blind test using non-causative candidate
   genes as controls. Specifically, non-causative genes were selected with
   identical criteria to true candidate genes, except biallelic variants
   were in cis rather than in trans or had low predicted pathogenicity
   scores. The clinical team applying the protocol consistently scored
   true candidate genes higher than control genes (Wilcoxon one-sided
   rank-sum P value = 0.0171, Methods, Supplementary Data [85]1),
   suggesting that the scores generated by the clinicians’ protocol may be
   used to prioritize candidates.

Results

De novo analysis

   Several highly penetrant, extreme phenotypic presentations underlying
   Mendelian and other congenital, complex human diseases have been linked
   to de novo mutations^[86]12,[87]18,[88]19. We began by evaluating all
   independent, sporadic trios with complete sequencing data for de novo
   mutation etiologies. We detected 78.3 de novo point mutations and 9.5
   de novo indels on average per proband genome concordant with the
   expectation^[89]20. Mutation count showed expected dependency on
   parental ages with Poisson-distributed adjusted counts, attesting to
   the quality of de novo calling (Fig. [90]2a, and Supplementary
   Data [91]2, Supplementary Fig. [92]4).

Fig. 2. De novo recurrence.

   [93]Fig. 2
   [94]Open in a new tab

   a De novo mutation counts per proband adjusted for parental ages. Blue
   vertical lines show the mean values of the distributions, and curves
   represent the Poisson fits. b Schematic of analytical test for the
   recurrence of de novos that considers distal splice-altering and exonic
   SNV and indel variants, their variant functionality scores, a
   genome-wide mutation rate model Roulette, and per-gene GeneBayes
   constraint values. “Like” variants refer to those of the same variant
   class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS],
   intronic indels [II]) and within the same functionality score and minor
   allele frequency thresholds. c Genes with highest significance values
   for de novo recurrence across the cohort, computed as described in b3
   (Pr’( y[g] )) and b4 (Q[g]), when focusing on missense variants with
   AlphaMissense and PrimateAI-3D scores; patients are represented as
   colored circles. Complete gene list with exact P values can be found in
   Supplementary Data [95]3. Note that multiple testing corrections have
   been applied in the form of both Bonferroni (**) and FDR (*)
   thresholds. d AlphaFold-predicted human LRRC7 protein structure
   (AF-[96]Q96NW7-F1) covering the leucine-rich repeat region with high
   predicted structural confidence (amino acid positions 86-463). The
   fifth and eighth LRR domains where missense de novos were found are
   highlighted in blue. Reference alleles for missense de novo variants
   observed across two UDN patients (red) are shown in circles. A
   depiction of LRRC7’s linear protein sequence (Ensembl ID
   ENSP00000498937) with InterPro predicted domains shown in colored boxes
   is below. e Overlap of phenotype terms annotated to each patient.

   We then sought to identify genes enriched for deleterious de novo
   mutations across our patient cohort. Although the UDN dataset contains
   sequenced sporadic cases and their unaffected parents, classic
   case-control associated burden tests are underpowered in the de novo
   regime because of the extreme rarity of these variants, difficulty in
   their identification without complete sequenced trios, and noise in
   variant sampling from both cases and controls^[97]21. Instead, de novo
   goodness-of-fit enrichment tests, which evaluate whether observed de
   novo events exceed what would be expected by chance based on mutation
   rates, are better suited for studying diseases with ultra-rare
   pathogenic variants^[98]18,[99]19. The power of this de novo enrichment
   calculation increases with better models of underlying mutation rates
   and estimates of variant deleteriousness. Recently, the rate of de novo
   emergence has been estimated at basepair resolution with a high degree
   of accuracy^[100]22. Newly developed deep learning models for
   predicting the pathogenicity of de novo and other variants also now
   exhibit unprecedented accuracy in distinguishing disease-relevant
   variants^[101]23,[102]24. We leverage these recent advances to build an
   accurate, unbiased statistical procedure called RaMeDiES-DN to detect
   genes enriched for deleterious de novos.

   Unlike the earliest generation of de novo recurrence approaches which
   leveraged Poisson approximations for runtime efficiency but could not
   take advantage of improved deleteriousness scores and mutation rate
   models^[103]18, RaMeDiES methods seamlessly incorporate per-variant
   deleteriousness scores and mutation rates without sacrificing runtime.
   Briefly, for a given observed variant in a gene, we define its
   “mutational target” as the sum of per-variant de novo mutation rates
   for all possible variants with as high or higher a deleteriousness
   score. By construction, this per-variant mutational target is expected
   to be a uniformly distributed statistic, enabling us to leverage
   closed-form analytical formulas for computing their cumulative
   probabilities (Supplementary Note [104]3). Our framework naturally
   combines different variant types including SNVs and indels with a
   distinct mutation rate model, and can interchangeably utilize various
   deleteriousness scores (Fig. [105]2b, Methods). Although current
   state-of-the-art de novo recurrence approaches also incorporate
   relevant variant-level information, they rely on a complex, permutation
   procedure^[106]12. RaMeDiES’ analytical approach eliminates the need
   for permutation-based significance calculations and can process large
   datasets in mere seconds while maintaining well-calibrated P values
   (Supplementary Fig. [107]5).

   We first focus on the subset of missense variants, which comprise a
   sizable proportion of known Mendelian disease-causing variants and for
   which new, specialized pathogenicity predictions exist (e.g.,
   PrimateAI-3D and AlphaMissense)^[108]23–[109]25. We find one
   significant gene, KIF21A, corresponding to the correct, complete
   diagnosis in one patient and a strong partial diagnosis in one other
   (Bonferroni-adjusted Cauchy-combined P value < 0.05, Fig. [110]2c).
   Notably, disease genes with a de novo mode of inheritance are expected
   to be under strong selection against heterozygous loss-of-function
   variants. We further refine our method to incorporate this intuition by
   prioritizing genes by their GeneBayes values, which indicate selection
   against heterozygous protein-truncating variants, using a weighted
   false discovery rate (FDR) procedure^[111]26–[112]28. With this
   correction, we obtain three gene findings at an equivalent significance
   threshold (Q-value < 3e-6) and eight gene findings at FDR 5%
   (Supplementary Data [113]3). Our second and third gene hits, BAP1 and
   RHOA, correspond to a known correct diagnosis in one patient and strong
   clinical matches in two other patients. Among the five remaining genes
   at FDR 5%, three genes (CACNA1C, COL4A1 and NOTCH1) correspond to known
   diagnoses in five patients and the top clinical candidate in one
   patient. Two impacted patients with de novo missense variants in the
   leucine-rich repeat region of LRRC7, a gene expressed in the brain, had
   phenotypic overlap of hypotonia and developmental delay; one patient
   additionally experienced nystagmus, staring spells, and balance
   problems and the second had ataxic gait (Fig. [114]2d–e). We discussed
   these findings with the managing clinical teams overseeing these cases,
   and these UDN patients have now been included in a clinical case series
   linking LRRC7 to an emerging neurodevelopmental
   disorder^[115]13,[116]29. Another gene, NRBP1, remains a strong
   candidate in two patients due to their neurological phenotype overlap
   and NRBP1’s expression in the brain. An initial functional study in fly
   through the UDN Model Organisms Screening Core was inconclusive. This
   gene has been submitted to MatchMaker Exchange.

   We next consider all exonic variants, including nonsense variants and
   indels, and further incorporate additional well-established
   deleteriousness predictors, CADD and REVEL^[117]30,[118]31. Different
   mutagenesis processes lead to indel mutations, so SNV mutation rate
   models can be inappropriate for modeling this mutation type for some
   genes^[119]32. We therefore constructed a separate per-gene mutation
   rate approximation for indels (see Methods for details). When we reran
   RaMeDiES-DN on all exonic variants using four deleteriousness
   predictors, we additionally identified KMT2B (Bonferroni-adjusted
   Cauchy-combined p-value < 0.05), corresponding to a correct diagnosis
   in four patients due to de novo indel variants (Supplementary
   Data [120]4, Supplementary Fig. [121]6a). The next seven gene findings
   at FDR 5% had already been identified when assessing recurrence of
   missense variants. At FDR 10%, we identify five new putative diagnoses.
   For instance, two patients had high impact missense de novo variants
   impacting H4C5, a histone gene that was not detected with significance
   in our missense-only enrichment test due to its lack of precomputed
   AlphaMissense scores. Both patients had infantile-onset gross motor
   developmental delays, dysmorphic facial features, and speech
   difficulties (Supplementary Fig. [122]6b,c). These and other phenotypes
   exhibited by each patient were recently found to be linked to missense
   variants in histone H4 genes^[123]33. For one of the patients, the de
   novo variant was contemporaneously interpreted by UDN clinical experts
   to be causal^[124]34. The second patient’s de novo variant has now been
   reclassified as “pathogenic” and resulted in a new diagnosis for this
   participant. Two other patients with sporadic neurodevelopmental delay
   each harbor truncating de novo variants in ZNF865. Both patients have
   phenotypic overlap with a series of 10+ other patients with ZNF865
   mutations, which makes a compelling case for pathogenicity^[125]35.
   Subsequent to the publication of the case series, we anticipate this
   gene–disease relationship will be established as causal and both
   variants to be reclassified as likely pathogenic.

Inclusion of deep intronic splice variants

   Next, we demonstrate how RaMeDiES-DN can be extended to additionally
   consider non-exonic variants uncovered uniquely from whole genome
   sequencing using the same methodological infrastructure. On the one
   hand, it remains challenging to identify non-coding regulatory variants
   involved in rare Mendelian diseases^[126]36, and the overall role of
   such variants in congenital disorders is still a subject of
   debate^[127]37. On the other hand, distal gain-of-splice site mutations
   creating new acceptor or donor splicing sites deep in the intronic
   sequences of genes are now a well-recognized cause of monogenic
   disease^[128]38. Identification of splice-altering variants directly
   from genome sequencing data is recently possible using newly-developed
   in silico predictive scores without relying on RNA sequencing. RNA
   sequencing has limitations for diagnosis because it depends on the
   availability of relevant tissue material that is especially challenging
   to obtain for neurodevelopmental patients, and it may miss
   lowly-expressed isoforms and those targeted by nonsense mediated
   decay^[129]39. Moreover, identifying disease-causal intronic splice
   variants is especially appealing due to their potential targetability
   using antisense oligonucleotide therapies^[130]40.

   Unlike functional predictions for exonic variants, which have been
   extensively validated for consistency and accuracy via decades of
   experimental in vitro and in vivo studies, functional predictions of
   splice-altering intronic variants are relatively new and still require
   experimental confirmation. We used a combined
   computational–experimental approach to prioritize distal splice
   variants using in silico predicted scores and an in vitro massively
   parallel splicing reporter assay (Methods, Supplementary
   Fig. [131]7)^[132]41,[133]42. We found the per-variant in silico
   predictions to be mostly concordant with the in vitro assay readouts.
   Variants assigned higher in silico scores are more frequently supported
   by the experimental, in vitro assay, and those with relatively lower in
   silico scores (SpliceAI <0.5) have a non-negligible validation rate as
   well (Supplementary Fig. [134]8). This prompted us to incorporate the
   full range of continuous SpliceAI scores, disregarding only the lowest
   scoring variants, in our statistics. We found this approach to consider
   distal splice-site variants attractive because it lends itself to a
   statistical analysis alongside exonic variants. Once genome-wide
   functionality score tracks are released for the next generation of
   splice predictors as well (e.g., Pangolin)^[135]43, they can be
   integrated into RaMeDiES using the same methodology leveraged for
   exonic variant predictors.

   No new candidate genes with a significant recurrence of intronic de
   novos were found in the UDN dataset. However, by seamlessly
   incorporating non-exonic variants within the same statistical test, our
   approach enables a more complete, automated analysis of the growing
   volume of whole genome sequencing data across rare disease consortia.

   We also ran the state-of-the-art de novo enrichment approach,
   DeNovoWEST^[136]12. Unlike our approach, DeNovoWEST incorporates a
   gain-of-function model alongside a loss-of-function model, which has
   the potential to yield additional findings. We equipped the DeNovoWEST
   algorithm with the Roulette mutation rate model, up-to-date CADD
   variant deleteriousness and s[het] gene constraint scores^[137]27, and
   further incorporated deep intronic variants with predicted
   splice-altering impact (Supplementary Fig. [138]9). This approach
   yielded two Bonferroni-significant genes when applied to the UDN
   dataset, one of which was also uncovered by RaMeDiES-DN at Bonferroni
   significance and the second at a FDR of 6% (KMT2B and H4C5,
   Supplementary Fig. [139]10). We did not apply an FDR-based approach to
   DeNovoWEST’s results to consider additional gene findings, because
   DeNovoWEST P values are a construct over three sometimes dependent
   tests, rendering an FDR adjustment inappropriate. We also find CSMD1, a
   highly indel-prone gene, within DeNovoWEST’s top-ranked five genes,
   likely because indels and SNVs are not distinguished in the mutation
   rate model^[140]44. We then ran DeNovoWEST and RaMeDiES-DN on a larger
   cohort of previously published and jointly-analyzed de novo variants
   from the Deciphering Developmental Diseases (DDD) study, GeneDx, and
   Radboud University Medical Center (RUMC). Both methods recovered known
   autosomal dominant disease genes at a comparable rate (Supplementary
   Fig. [141]16, and Supplementary Data [142]5), but RaMeDiES-DN required
   less than one-tenth the runtime.

   As opposed to methods like DeNovoWEST that require individual-level
   data, RaMeDiES’ operation at the level of mutational targets enables
   sharing of intermediate statistics across cohorts without revealing
   patients’ individual variants. The rare disease field has accumulated
   many individually small, but collectively large, cohorts that have
   never been included in any joint analyses because data governance
   restricts external sharing of variant- and patient-level data. As a
   proof of concept, we ran RaMeDiES-DN on the individual DDD, GeneDx, and
   RUMC de novo cohorts and subsequently combined the three summary-level
   mutational target statistics to generate gene findings. This
   meta-analysis run resulted in the same per-gene significance values as
   compared to RaMeDiES-DN’s run on individual variants from the combined
   cohort (Supplementary Data [143]6).

Compound heterozygous variant analysis

   We next evaluate compound heterozygous (comphet) variants, which are
   the most likely cause of rare recessive disorders in populations with
   low degrees of consanguinity, as is largely the case in the United
   States^[144]45. Comphet variants are defined as a pair of alleles
   landing within the same gene and inherited in trans from unaffected
   parents who are also heterozygous at these loci. These inherited
   disease-causing variants tend to be rare in the population, due to the
   effect of selection against biallelic variant occurrences or against
   slightly deleterious phenotypes of heterozygous variants^[145]46.
   Despite the expected low frequency of individual alleles comprising a
   comphet pair, directly selecting for highly deleterious comphet
   variants still results in numerous false positive findings at the
   cohort level, motivating a statistical approach for cohort-level
   comphet prioritization. Developing a statistical framework analogous to
   de novo goodness-of-fit enrichment requires modeling the distribution
   of rare inherited alleles per individual. De novo mutations arise
   through the universal process of mutagenesis and are therefore
   straightforward to model. Similarly, the distribution of the total
   number of all derived alleles per haploid genome (i.e., all
   non-ancestral variants inherited from one parent without any imposed
   frequency constraints) are also not dependent on the demographic
   history of the population and therefore are straightforward to
   model^[146]47,[147]48. In contrast, however, the distribution of the
   total number of rare alleles per individual is highly dependent on
   population structure, which is notoriously difficult to account for.
   Some previous approaches for determining cohort-level significance of
   comphet variants ignore population structure when modeling the number
   of rare variants. Although this may be an accurate statistical test in
   controlled model organism cross experiments, it is inappropriate for
   natural human populations, where population structure is present even
   at a very fine scale^[148]49. In the Genome of the Netherlands (GoNL)
   dataset for instance, the number of synonymous singletons across
   unrelated individuals still reflects geographic structure along a
   south-north cline^[149]50.

   In our framework, we sidestep directly modeling the distribution of
   rare variant counts per individual and instead condition on the
   observed number of rare variants inherited from each parent using
   trio-level data. Given the number of rare variants inherited from each
   parent per individual, we then compute the probabilities of comphet
   variants landing in high-scoring positions in the same gene across the
   cohort. Although the positions where inherited variants land is
   influenced in part by direct and background selection and biased gene
   conversion, for very rare variants, the effect of these factors is
   negligible compared to the effect of the variation in mutation rate
   along the genome and the overall gene target size^[150]22,[151]51. We
   therefore model the positional distribution of rare inherited variants
   using the same Roulette basepair-resolution de novo mutation rate model
   leveraged in our de novo recurrence model. Our comphet recurrence
   model, called RaMeDiES-CH, relies on the comphet mutational target,
   computed for each comphet variant pair and defined similarly as the de
   novo mutational target previously introduced. Specifically, the comphet
   mutational target is computed as the total squared mutation rate of all
   possible variants with higher functionality scores (Fig. [152]3a).
   RaMeDiES-CH applies the Cauchy P value combination approach as before
   to leverage multiple variant-level functionality scores while
   considering exonic and intronic variants, but does not incorporate gene
   constraint scores, which do not exist for recessive selection (Methods,
   Supplementary Fig. [153]11)^[154]52. RaMeDiES-CH computes
   well-calibrated per-gene P values for comphet variants in a cohort
   (Supplementary Fig. [155]5).

Fig. 3. Compound heterozygous variants.

   [156]Fig. 3
   [157]Open in a new tab

   a Illustration of the unnormalized squared mutational target computed
   for each observed comphet variant in a gene across the cohort
   (RaMeDiES-CH, Supplementary Fig. [158]11) or in an individual across
   the genome (RaMeDiES-IND, Supplementary Fig. [159]12). “Like” variants
   refer to those of the same variant class (i.e., coding SNVs [CS],
   coding indels [CI], intronic SNVs [IS], intronic indels [II]) and
   within the same functionality score and minor allele frequency
   thresholds. b Top ranked genes resulting in the best enrichment
   statistic computed for RaMeDiES-IND. Putative candidates refer to genes
   that remain candidates for pathogenicity due to their
   phenotypically-relevant tissue expression, but where there is not
   enough functional evidence or published gene–disease relationships to
   establish causality at this time. c Overlap between phenotypes
   associated with MED11 and those exhibited by the affected patient. d
   RNA-Seq reads from whole blood samples aligned to first two exons and
   first intron of MED11 for proband (black), dad (blue), mom (purple) and
   two tissue-matched control samples (gray). Thin green line represents
   the intron, solid boxes represent protein-coding exonic regions, and
   the dotted box represents the 5’ untranslated region of MED11. (e)
   Proband exhibits significant retention of the first intron relative to
   parents and fifty-three tissue-matched control samples. Intron
   retention ratio is calculated as the (median read depth of first
   intron) / (number of reads spanning first and second exons + median
   read depth of first intron).

   Across the set of non-consanguineous UDN families, we did not find
   significant recurrent comphet occurrences across genes. This result is
   unsurprising, as previous estimates suggest that in panmictic disease
   populations, only one deleterious comphet variant is expected for every
   five dominant de novos^[160]49. Nevertheless, RaMeDiES-CH represents an
   accurate and unbiased statistical test for the recurrence of comphet
   variants in human populations, which can be applied to reveal new
   diagnoses as sequenced rare disease datasets expand.

   We suspected that singleton disease-causing comphet variants were still
   present in the cohort. We adapted our statistical framework to compute
   an individual-based statistic, RaMeDiES-IND, that normalizes each
   observed comphet variant mutational target across all genes in the
   genome rather than across all individuals in a cohort (Supplementary
   Fig. [161]12). This approach yielded a ranked list of patient–gene
   pairs across the UDN cohort, where each patient–gene pair could be
   annotated as corresponding to a correct diagnosis or otherwise
   (Supplementary Data [162]7). We computed a single enrichment statistic
   for this overall patient–gene ranking, which simultaneously suggested a
   threshold for clinical consideration of findings, as the best Fisher’s
   exact test P achieved across all positions in the list. This enrichment
   statistic was significant when compared to the distribution of the same
   statistic computed across 10,000 random shuffles of the patient–gene
   list (permutation P value = 0.001, Methods, Supplementary
   Fig. [163]13). Among the top thirteen hits yielding this best
   enrichment statistic, we recapitulated five known diagnoses (i.e.,
   NEUROG3, PAH, COX20, NDUFAF8, PRDX3)^[164]53,[165]54 and newly
   identified the genomic cause of a known biochemical diagnosis (i.e.,
   ACADM in a patient with MCAD deficiency). We also identified comphet
   variants in MED11 which are now leading diagnostic candidates in an
   undiagnosed patient experiencing neurodegeneration, developmental
   delay, brain abnormalities, chorea, and hypotonia (Fig. [166]3c). MED11
   is associated with epilepsy and intellectual disability, and this
   patient’s presentation could represent a phenotypic expansion of this
   known disorder^[167]55. Both inherited variants occur deep in the first
   intron of MED11, a region that would be missed by exome-only sequencing
   or analysis, and are predicted to cause cryptic splice donor gains.
   Transcriptome (RNA) sequencing of blood samples from the affected
   patient and both parents highlighted a significantly higher rate of
   first intron retention in the affected patient relative to both parents
   and to fifty unrelated blood control samples (Fig. [168]3d–e, and
   Supplementary Fig. [169]14)^[170]56.

   Our comphet models assume that the two inherited alleles comprising a
   biallelic variant pair arose as a result of independent mutations.
   Homozygous recessive variants, where the exact same allele is inherited
   from each parent, are considered alongside compound heterozygous
   variant configurations in cases where the allele independence
   assumption is unlikely to be violated. In individuals with
   consanguineous parentage, however, the same heterozygous variant in
   both related parents most likely arose from a single, ancestral
   mutational event. As such, we directly exclude probands with evidence
   of parental relatedness from our analysis (Supplementary
   Note [171]4)^[172]49.

Pathway analysis

   Genes involved in the same pathway are frequently involved in similar
   phenotypic presentations^[173]57–[174]60. This provides an enticing
   possibility of drawing statistical power from multiple independent
   occurrences of deleterious variants in the same functional units,
   rather than just in the same genes. Moreover, therapeutics for
   disorders of the same functional unit that are individually too rare to
   meet minimal participant requirements for clinical trials may be
   evaluated together within the same umbrella or basket trial for more
   efficient approval^[175]61. However, such an approach should be pursued
   with caution, as the phenotypes stemming from perturbations of
   different genes in the same functional unit may vary to a great extent.
   Such differences in patient presentations may render the clinical
   evaluations and therapeutic potential of statistically significant
   findings virtually impossible. To mitigate this issue, we first
   initially consider groups of patients with similar phenotypes, and then
   within each of these groups, assess the overrepresentation of
   deleterious mutations across established biological pathways
   (Fig. [176]4a).

Fig. 4. Biological pathways enriched within phenotypically-similar patient
subgroups.

   [177]Fig. 4
   [178]Open in a new tab

   a Schematic illustrating the two-step process of first clustering
   patients according to the semantic similarity of their phenotype terms
   and second finding enriched biological pathways among the genes within
   each patient cluster. b The most significant pathways per cluster (P
   value < 0.01, adjusted for multiple testing using g:Profiler’s
   Statistical Correction Scheme^[179]103) with 1+ genes from 1+
   undiagnosed patients; complete list in Supplementary Data [180]8. c Two
   patients with primarily immune-related symptoms each harbored a
   compelling de novo variant in genes involved in immunoproteasome
   assembly (POMP) and structure (PSMB8). Their symptoms strongly overlap,
   and a subset of these symptoms were also known to be associated with
   either gene in OMIM. d Three neurological patients had variants in
   transmembrane genes involved in the same pathway. These patients had
   substantial phenotypic overlap with each other, as expected, and with
   the phenotypes associated with each of their genes (depicted as star
   shapes in the upset plot). All icons in (a), (c) and (d) are from
   Microsoft PowerPoint.

   We start by clustering 2662 affected patients—with or without
   sequencing data—into 120 groups (median = 17, min = 2, max = 97
   patients per cluster) based on the semantic similarity of their
   phenotype terms. Within each cluster, we then combine our de novo
   candidates, compound heterozygous candidates and known UDN diagnoses
   and perform gene set enrichment analysis (Methods, Supplementary
   Data [181]8). We focus our attention on undiagnosed cases with de novo
   or compound heterozygous candidates within enriched pathways in each
   cluster (Fig. [182]4b). We also report all enriched pathways including
   those with only diagnosed patients for potential therapeutic grouping
   (Supplementary Data [183]9).

   Two of three total candidate genes in one cluster with 19 immunological
   disorder patients are both involved in the immunoproteasome complex
   (KEGG:03050, n = 46, adjusted P value = 4.42e-3). One patient’s genome
   contained a known diagnostic, de novo frameshift variant in POMP, an
   immunoproteasome chaperone protein^[184]62. An undiagnosed patient with
   evidence of chronic inflammation, recurrent infections, and skin
   lesions had a missense de novo in PSMB8, a component of the
   immunoproteasome β-ring with overlapping phenotypic associations
   (OMIM:256040). Both patients had similar combined immunodeficiency
   beyond what was captured in their standardized phenotype terms,
   including decreased global antibodies, decreased B cells and natural
   killer cells, and retained T cell functionality (Fig. [185]4c).
   Disruptions to immunoproteasome assembly and structure have been shown
   to lead to an accumulation of precursor intermediates, impaired
   proteolytic activity and subsequent uncontrolled inflammation^[186]63.

   In another cluster of 15 similarly presenting neurological patients,
   three candidate transmembrane genes were represented in the same
   functional pathway named for some genes’ known involvement in taste
   transduction (KEGG:04742, n = 85, adjusted P value = 7.45e-3). Two of
   these genes, CACNA1C and GABRA3, harbored high impact de novo and
   hemizygous missense variants respectively, corresponding to known
   patient diagnoses^[187]64,[188]65. The genome of another, undiagnosed,
   now deceased patient from this cluster with no prior candidate variants
   contained a synonymous de novo variant predicted to alter splicing in
   another gene in the same functional pathway, HCN4 (Fig. [189]4d). All
   three patients exhibited seizures at a young age, speech delays, severe
   hypotonia, spasticity and visual impairment. Mouse knockouts of HCN4
   demonstrate neurological phenotypes^[190]66,[191]67. In humans, HCN4 is
   expressed in the visual and nervous systems and has recently been
   associated with infantile epilepsy, suggesting that this patient’s
   undiagnosed disorder plausibly represents a phenotypic expansion of
   this gene^[192]66,[193]67.

Discussion

   In total, we analyze 886 sporadic or suspected recessive cases with
   complete trio or quad genome sequencing alongside an additional 463
   phenotyped, diagnosed individuals using computational methods to
   identify de novo recurrence, compound heterozygosity, and pathway
   enrichment. We establish five new diagnoses and three new putative
   diagnoses in known disease-causing genes or genes previously unlinked
   to these patients’ exact presentations. Our prioritization framework
   for pathway analysis further recapitulates 70 known de novo and 10
   known comphet diagnoses and suggests 82 de novo and eight comphet
   candidates for follow-up (Methods, Supplementary Data [194]8).

   In the field of common disease genetics, statistical inference of
   disease-associated genomic loci is confidently regarded as primary
   evidence for their causality. Rare disease genetics, in contrast, is in
   a transition state. Due to a lack of large disease-matched cohorts,
   case-based analyses of individual probands relying heavily on detailed
   patient phenotyping and clinical intuition have typically been used to
   generate candidate variant hypotheses. Evidence required to shift these
   variants from uncertain significance to known pathogenic status comes
   from experimental, functional studies and by identifying additional,
   unrelated, genotype-matched individuals with similar phenotypes through
   variant matchmaking services such as MatchMaker
   Exchange^[195]68,[196]69. Recently, analyses of large,
   broadly-phenotyped cohorts of rare disease patients have demonstrated
   the potential for statistical approaches to reveal diagnoses and
   generate new gene discoveries in the rare disease space as
   well^[197]12,[198]14,[199]70.

   Although the genome is a big place, it is also a finite space with
   respect to gene regions impacted by simple variants such as SNVs and
   short (<10 basepairs) indels. This suggests that, in theory,
   recurrence-based statistical methods applied to sufficiently large
   sequenced cohorts of rare disease patients, even those with diverse
   phenotypic presentations like the UDN, will enable the eventual
   discovery of all causes of prenatally viable monogenic disease stemming
   from these variant types. In order to take statistical discoveries as
   primary evidence, as is the case for common diseases, we need accurate,
   well-calibrated statistical methods^[200]71. Even slight model
   misspecification may propagate and exacerbate the rate of false
   discoveries. The rapid growth of genomic datasets on which these models
   may be applied, coupled with an ongoing difficulty in phenotyping
   patients at scale to confirm findings^[201]72, further increases the
   urgency for more rigorous models.

   Here we show that well-calibrated statistical models can be built for
   both de novo and compound heterozygous modes of inheritance. Although
   novel disease–gene discovery from large, phenotypically- and
   genetically-homogenous cohorts has been demonstrated, we show here that
   rigorous analysis of a diverse, moderately-sized disease cohort at the
   gene and the pathway level shows promise.

   Extending our RaMeDiES statistical framework to consider variant types
   with less accurate mutation rate models or pathogenicity predictions
   may lead to a sacrifice of precision. We therefore acknowledge the
   limitations of our statistical approach for comprehensive rare disease
   diagnosis efforts using short-read sequencing data, and suggest that
   our cohort-level models be applied alongside existing case-based
   variant prioritization pipelines for additional
   statistically-meaningful findings.

   First, although our models integrate non-coding variants with predicted
   splice-altering impacts, they do not consider potentially functional
   variants within whole genome data that fall into untranslated gene
   regions, RNA-coding genes or between genes, as genome-wide tracks of
   verifiable deleteriousness scores do not yet exist for these variant
   types. Improvements to and precomputed scores for these variants will
   be beneficial for interpretation efforts in general and can be
   leveraged in future iterations of RaMeDiES. Our statistical analysis
   also does not consider structural, large indel, copy number, or tandem
   repeat variants, as their identification from short-read sequencing
   data is computationally expensive and often inaccurate. Ongoing efforts
   to generate long-read sequencing data within the UDN and elsewhere
   should enable improved identification and analysis of pathogenic
   complex variants^[202]73,[203]74. Developing a statistical model for
   these variants will still require accurate mutation rate estimates for
   these variant types, which is lacking. GnomAD-SV represents a promising
   iteration toward this goal, but is still highly dependent on their
   specific variant calling pipeline and data rather than biological
   mutagenic processes^[204]75.

   The presented method considers only autosomal de novo and compound
   heterozygous inheritance patterns due to complications in modeling
   other disease-relevant inheritance patterns. First, it is difficult to
   propose a statistical model for biallelic variant counts in
   consanguineous and founder populations, because these counts strongly
   depend on the ancestral population history and inbreeding patterns. A
   more appropriate statistical approach for assessing recurrence of these
   variants would be the extension of parametric linkage applied to very
   large cohorts^[205]76. Second, inclusion of hemizygous or other
   X-chromosome variants requires accurate sex-chromosome variant calling,
   which is notoriously error prone, as well as an accurate mutational
   model of the X chromosome, which is complicated due to sex-dependent
   selection and random X-inactivation^[206]77. Finally, although we do
   not model parental mosaicism or uniparental disomy in our recurrence
   statistics, these inheritance patterns and events are regularly
   assessed via complementary, traditional case-based approaches^[207]10.

   Even though genomic sequencing has been liberalized, currently many
   analyses are still restricted to individual programs, and regulatory
   and technical barriers prevent sharing individual-level variant data
   broadly. In contrast, there are avenues for sharing some variant-level
   data in a way that is easily accessible to clinical geneticists.
   MatchMaker Exchange, for instance, enables the sharing of specific
   variants prioritized through case-based analyses with the goal of
   finding new genotype- and phenotype-matched patients. Broadening the
   success of MatchMaker Exchange to include variants that may not have
   risen to the level of strong candidates in analyses of individual
   patients is desirable. We developed a browser containing our gene-level
   findings and variant-level information about rare genetic variation in
   UDN patients ([208]https://dbmi-bgm.github.io/udn-browser/). In
   addition, we provide an open-source software package, RaMeDiES,
   implementing the efficient and well-calibrated statistics for de novo
   recurrence and deleterious compound heterozygous inference proposed
   here. RaMeDiES’ operation on shareable summary statistics rather than
   on variant-level data enables automated, deidentified cross-analysis of
   substantial existing yet siloed sequenced cohorts for new diagnostic
   discoveries. As the Mendelian genomics field continues the transition
   to this new data science phase, the methods we present here should
   facilitate the exciting prospect of international cross-cohort
   analyses, resulting in new findings and a vastly improved rare disease
   diagnostic rate globally.

Methods

   This work was performed in accordance with all ethical guidelines
   outlined in the NIH IRB #15HG0130. The study proposal and manuscript
   were approved by the UDN Publications and Research Committee. All
   participants have provided written, informed consent for the sharing
   and use of their data in this study.

Undiagnosed diseases network (UDN) structure

   The Undiagnosed Diseases Network (UDN) was established in 2014 with the
   goal of uncovering clinical diagnoses and novel disease-causing genetic
   variants and their molecular functionalities. In its current phase, the
   UDN is composed of 12 clinical research centers across the United
   States and a CLIA-certified sequencing core at Baylor Genetics. Typical
   UDN patients have already endured a multiyear “diagnostic odyssey” of
   extensive prior testing by multiple medical specialists and often
   inconclusive targeted, whole exome and even whole genome sequencing at
   the time of their application to the UDN.

   As part of the application process, a team of clinicians and genetic
   counselors at one of the UDN clinical sites reviews the patient’s
   medical records, referral letters and lab data and creates an
   abstracted case review document. If the team concludes that a UDN
   evaluation may aid in the identification of a diagnosis, the patient is
   accepted to the program and undergoes a thorough in-person evaluation
   at their assigned clinical site. Most patients and available affected
   and unaffected family members receive whole genome sequencing (GS) as
   well. All genomic sequencing data, clinical sequencing reports prepared
   in accordance with the American College of Medical Genetics and
   Genomics (ACMG) variant classification guidelines, structured
   phenotyping in the form of Human Phenotype Ontology (HPO) terms, lab
   results, imaging data, medication data, referral letters and clinical
   notes, the abstracted case review document, and candidate variants and
   genes are uploaded to the UDN Data Management and Coordinating Center.
   All patients enrolled in the UDN have consented to the broad sharing of
   all their genomic, phenotypic and clinical data with researchers
   network-wide for use in research projects and when evaluating
   gene–phenotype fit for a specific patient and candidate gene. Moreover,
   UDN patients have consented to follow-up if additional tests or
   information are deemed useful.

Harmonization of whole genome sequencing data

   Short-read whole genome sequencing was performed between 2014 and 2022
   in accordance with the UDN Manual of Operations, which specifies that
   the average coverage across the genome must be >40X, and >97.5% of all
   coding and noncoding genes (UTRs, coding regions, and intronic regions)
   must be covered at >20X. Paired-end FASTQs were retrieved in June 2022
   for 4268 samples collected from 4236 unique individuals. Six
   individuals subsequently dropped out of the UDN program and are
   excluded from the analyses presented here. All FASTQ pairs were within
   expected parameters
   ([209]https://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
   version 08-01-19) and were aligned to human reference hg38 (with decoys
   and all alt contigs) using the Sentieon^[210]78 bwa-mem implementation
   via the Clinical Genome Analysis Pipeline (CGAP,
   [211]https://cgap.hms.harvard.edu/, version 29cefcec). Read groups were
   added via a custom CGAP script, multiple FASTQ pairs corresponding to
   the same sample were merged, and resulting BAMs were sorted, deduped,
   and recalibrated using a Sentieon implementation. GVCFs were produced
   using CGAP’s implementation of GATK’s HaplotypeCaller. All processing
   steps from FASTQ to GVCF were deployed on spot instances in Amazon Web
   Services (AWS). GVCFs were then egressed to the Harvard Medical School
   institutional cluster. SNVs/indels were jointly called across genomic
   shards then merged using Sentieon tools (version 202112.02). Per-sample
   sex and cross-sample relatedness were confirmed using Somalier (version
   0.2.15, Supplementary Fig. [212]1)^[213]79. We required that all trios
   under consideration in our analysis had two parents reported as
   “unaffected”, a child reported as “affected”, parent–child relatedness
   0.5 ± 0.075, parent–parent relatedness <0.15, mothers had heterozygous
   variants present and a scaled mean depth of ~2 on chromosome X, and
   fathers had a scaled mean depth of ~1 on chromosome Y. All variants
   were annotated using Ensembl VEP (version 108) and slivar (version
   0.2.7) for TOPMed and per-population gnomAD (versions v2.1.1 and
   v3.1.2) variant frequencies and homozygote counts^[214]80,[215]81.

   For our compound heterozygote analysis, we inferred within-family
   regions of identity by descent (IBD) using KING (version
   2.3.0)^[216]82. We required at least one IBD region between the child
   and each parent to further confirm relatedness (in addition to kinship
   coefficient filtering) and no IBD regions of length > 3 Mb between
   parents to confirm non-consanguinity between parents. In families with
   multiple affected siblings, we select one sibling as the proband and
   disregard the other siblings during initial analyses. Variants in other
   affected siblings were then used to check segregation during validation
   of our findings. This process resulted in 846 non-consanguineous trios
   with an affected child and two unaffected parents for our analyses. We
   chose to stringently exclude individuals with evidence of familial
   consanguinity (i.e., by imposing a parental relatedness and IBD region
   length constraints) rather than excluding patients based on their
   relative recessive burden because an assumption of our statistical
   models is violated in consanguineous cases (Supplementary Note [217]4).

Clinical evaluation framework

Protocol overview

   We developed a clinical analysis protocol to reduce subjectivity in the
   assessment of diagnostic candidates. We used the case evaluation
   process implemented at Brigham Genomic Medicine as a
   foundation^[218]83. We then transformed this process into a systematic
   and structured protocol with inspiration from the gene–disease
   association criteria developed by the Clinical Genome Resource
   (ClinGen) group^[219]17,[220]68. Evidence in support of or against a
   candidate variant–participant match was evaluated by a team of clinical
   geneticists according to three categories for experimental evidence not
   taken into account by our statistical analyses: (i) model organism or
   cell line studies, (ii) tissue expression, and (iii) protein molecular
   function. Clinicians also took into account case-level data and
   published literature with case-control data including (iv) known
   disease associations, (v) gene evolutionary constraint, and (vi)
   variant pathogenicity. Discrepancies in opinion were mediated by joint
   discussion until a consensus decision was reached. A detailed
   description of the protocol and scoring scheme is available in
   Supplementary Note [221]2 and hierarchical decision trees to streamline
   the scoring process are provided in Supplementary Fig. [222]2.

Clinical score calibration

   We ensured that the protocol was specific and detailed to the extent
   that different clinicians with access to the same patient data would
   independently assign equal clinical scores to the same candidates. Over
   the course of two months, at least two clinicians each evaluated 2–3
   compound heterozygous candidates per week and independently recorded
   their notes, final clinical scores, and score rationale in a REDCap
   database^[223]84. At weekly joint discussions, they iteratively updated
   the protocol to improve specificity and reduce discrepancies in
   scoring. The final two joint discussions confirmed that categorical and
   final clinical scores assigned by different clinicians were
   consistently in agreement.

Validation

   The clinical team was provided with ten “candidates” and ten “decoys”
   from real UDN patients in random order to evaluate. The team was
   blinded to gene labels, variant inheritance and SpliceAI score
   information during evaluation. “Candidate” genes had two rare variants
   (gnomAD popmax AF < 0.001) inherited in trans where one variant was
   exonic with CADD > 23 and the second variant was intronic with a max
   SpliceAI > 0.3. “Decoy” genes were selected with identical criteria
   except that variants were actually inherited in cis or the intronic
   variant had a maximum SpliceAI score of 0. After assigning final
   clinical scores to each of the 20 genes, the candidate/decoy labels
   were revealed to the clinical team (Supplementary Data [224]1).

Identification of de novo variants

   For each of the 1463 sequenced trios in our harmonized UDN dataset,
   including trios with unaffected offspring, we select the subset of
   variants with read depth >10 and genotype quality (GQ) >20 across
   proband, mother and father. We further subset to variants with a “high”
   Roulette quality score, gnomAD population maximum allele frequency
   <0.01, TOPMed^[225]85 allele frequency < 0.01, proband alternate allele
   read depth >4 and frequency > 0.2, and alternate read depth <2 in both
   parents.

   We then utilize observed aligned reads across each trio and across
   thirty unrelated individuals to assign posterior probabilities to each
   putative de novo variant on autosomes using the CGAP reimplementation
   of novoCaller ([226]https://cgap.hms.harvard.edu/)^[227]16. We consider
   all de novos with a novoCaller posterior probability >0.7 to be high
   confidence, noting that thresholding the novoCaller posterior
   probability from 0.5 to 0.95 has negligible impact on the number of
   passing variants overall and per-proband (Supplementary Fig. [228]6a).
   We further exclude probands with over 150 high confidence de novo
   calls, as these patients frequently had “suspected parental mosaicism”
   mentioned in their clinical records. Finally, because clonal sperm
   mosaicism may lead to siblings inheriting identical de novo variants,
   we exclude duplicate de novo variants within each family from
   downstream recurrence analyses^[229]86. This process resulted in 1072
   trios with an affected proband and unaffected parents for further
   analysis.

Analytical test for de novo cohort-level recurrence

Basic statistic definition

   We define a cohort as a set of N genomes (i.e., collections of genes)
   each with sets of de novo variants arising independently but based on
   the same background de novo mutation rate. Let
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   denote the de novo mutation rate of a specific variant i. The
   mutational target of a gene g is
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>g</mi></mrow>
   </msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>.
   </mo> :MATH]
   1

   The mutational target of a variant
   [MATH: <mi>v</mi> :MATH]
   in gene g is
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>v</mi></mrow
   ></msub><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>g</mi></mrow>
   </msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub><mspa
   ce width="0.25em"></mspace><msub><mrow><mi
   mathvariant="double-struck">1</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>
   o</mi><mi>r</mi><msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></m
   sub><mo>≥</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><msub><mrow><mi>e
   </mi></mrow><mrow><mi>v</mi></mrow></msub></mrow></msub> :MATH]
   2

   where
   [MATH:
   <mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><msub><mrow><mi>e</mi></mrow><m
   row><mi>i</mi></mrow></msub> :MATH]
   is the deleteriousness score of variant i. Intuitively, the more
   surprising and/or deleterious a variant, the smaller its mutational
   target. By definition, variant mutational targets are uniformly
   distributed from 0 to
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub>
   :MATH]
   , so
   [MATH:
   <mfrac><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi
   >v</mi></mrow></msub></mrow><mrow><msub><mrow><mi>μ</mi></mrow><mrow><m
   i>g</mi></mrow></msub></mrow></mfrac><mo>~</mo><msub><mrow><mi>U</mi></
   mrow><mrow><mi mathvariant="normal">[0,1]</mi></mrow></msub> :MATH]
   3

   ppose there are K de novo variants falling within gene g across the
   cohort, where
   [MATH: <mi>K</mi><mo>≥</mo><mn>1</mn> :MATH]
   . We define a statistic y as
   [MATH: <mi>y</mi><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>K</mi></mrow></msubsup><mf
   enced close=")" open="("><mrow><mi
   mathvariant="normal">1</mi><mo>−</mo><mfrac><mrow><msub><mrow><mi>μ</mi
   ></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></msub></mrow><mrow><msub><mrow><mi>μ</mi>
   </mrow><mrow><mi>g</mi></mrow></msub></mrow></mfrac></mrow></mfenced>
   :MATH]
   4

   Note that Y is a sum of K uniformly distributed variables on [0,1]
   under the null. The distribution of Y given parameter K can thus be
   modeled by the Irwin-Hall (IH) “sum of uniforms'' distribution, which
   has a closed form for its cumulative density function (CDF) and thus
   also for its survival function (SF), where
   [MATH: <mi>S</mi><mi>F</mi><mi mathvariant="normal">=
   1</mi><mo>−</mo><mi>C</mi><mi>D</mi><mi>F</mi> :MATH]
   ^[230]87. This enables us to replace permutation-based significance
   evaluations and instead analytically compute the probability of
   achieving a Y as high or higher than observed with K variants using the
   IH survival function as
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>y</mi><mo>∣</mo><mi>K</mi
   ></mrow><mo>)</mo></mrow><mo>=</mo><mi>I</mi><msub><mrow><mi>H</mi></mr
   ow><mrow><mi>S</mi><mi>F</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>y<
   /mi><mo>∣</mo><mi>K</mi></mrow><mo>)</mo></mrow> :MATH]
   . We note that there are many other constructions over a set of
   uniformly-distributed random variables (such as P
   values)^[231]88,[232]89. We further note that as the cohort size
   dramatically increases, the Irwin-Hall distribution can be replaced
   with the normal distribution.

   Finally, we also modelPr(K), the probability of K independent de novo
   variants to land in gene g given this cohort of size N, to assign an
   overall significance value to our statistic y as
   [MATH: <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><mi>y</mi></mrow><mo>)</mo></mrow><mo>=</mo><ms
   ubsup><mrow><mo mathsize="big">∑</mo></mrow><mrow><mi>K</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>∞</mi></mrow></msubsup><mi
   >P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><mi>y</mi><mo>∣</mo><mi>K</mi></mrow><mo>)</mo>
   </mrow><mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi></mrow><mo>)
   </mo></mrow><mo>.</mo> :MATH]
   5

   Because neither y nor Pr(K) are defined for
   [MATH: <mi>K</mi><mi mathvariant="normal">= 0</mi> :MATH]
   , we do not expect Pr( y) to be uniformly distributed. Instead, only
   [MATH: <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><mi>y</mi><mo>∣</mo><mi>K</mi><mo>≥</mo><mi
   mathvariant="normal">1</mi></mrow><mo>)</mo></mrow><mo>=</mo><mi>P</mi>
   <mi>r</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><mi>y</mi></mrow><mo>)</mo></mrow><mo>/</mo><mi
   >P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi><mo>≥</mo><mi
   mathvariant="normal">1</mi></mrow><mo>)</mo></mrow> :MATH]
   is expected to be uniformly distributed (Supplementary Fig. [233]9).

   In a single genome with
   [MATH: <mi>n</mi> :MATH]
   total observed de novo variants, the number of de novo variants to land
   in a particular gene g, given that
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>≪</mo><m
   n>1</mn> :MATH]
   , is Poisson distributed, parameterized by the expected number of de
   novos
   [MATH:
   <mi>λ</mi><mo>=</mo><mi>n</mi><msub><mrow><mi>μ</mi></mrow><mrow><mi>g<
   /mi></mrow></msub> :MATH]
   . In a cohort of N genomes, the number of de novo variants to land in
   gene g is therefore a sum of N Poisson-distributed random variables,
   which itself is also Poisson distributed. We thus compute
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>P</mi><mi>o</mi><mi>i</mi><mi>s</mi><mrow><mo>(</mo>
   <mrow><mi>K</mi><mo>∣</mo><mi>λ</mi></mrow><mo>)</mo></mrow> :MATH]
   , where λ is given by
   [MATH: <mi mathvariant="double-struck">E</mi><mspace
   width="0.25em"></mspace><mrow><mo>[</mo><mrow><mi>K</mi></mrow><mo>]</m
   o></mrow><mo>≡</mo><mi>λ</mi><mo>=</mo><msub><mrow><mi>μ</mi></mrow><mr
   ow><mi>g</mi></mrow></msub><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>N</mi></mrow></msubsup><ms
   ub><mrow><mi>n</mi></mrow><mrow><mi>j</mi></mrow></msub><mi
   mathvariant="normal"></mi><mo>.</mo> :MATH]
   6

Different deleteriousness scores for coding and intronic variants

   We use continuous, per-variant deleteriousness scores that are
   precomputed and publicly-available for all possible variants
   genome-wide in our computations. Precomputed scores are required for
   the calculation of comprehensive, basepair-resolution mutational
   targets as described above. For missense variants, we interchangeably
   use AlphaMissense (version hg38 released with their 2023 publication),
   PrimateAI-3D (academic license, accessed May 2024), CADD (version 1.6),
   and REVEL (accessed May 2024).23,24,30,31 CADD is also used for scoring
   all other exonic variants, including nonsense and indel variants. For
   intronic variants, we use SpliceAI (academic license, accessed May
   2021).41 We use different variant functionality scores for exonic and
   intronic variants because we found that these values are poorly
   correlated with each other in intronic space (Supplementary
   Fig. [234]15). Clinical sequencing centers also regularly report these
   scores, suggesting their relevance in rare disorders^[235]68,[236]90.

Different mutation rate models for SNV and indel variants

   We use Roulette de novo mutation rates for SNVs genome-wide. Different
   mutational processes lead to indel mutations, so Roulette values cannot
   necessarily be adapted to model this mutation type.32 We approximated
   per-gene joint distributions of indel mutation rates and
   deleteriousness scores as follows. First, we considered all possible
   exonic indels of length ≤10nt for which precomputed CADD scores were
   available for download and all possible intronic insertions of length
   1nt and deletions of length ≤4nt for which precomputed SpliceAI scores
   were available for download. Although SpliceAI provides predictions
   exhaustively for all possible indels, CADD provides scores for the
   subset of indels observed in gnomAD-v2. We excluded all indels that
   overlapped with any SNVs assigned a Roulette “low quality'' filter,
   which are based on gnomAD quality metrics, abnormal density of
   segregating sites, and suspicious patterns of recurrence. We further
   excluded indels with a gnomAD popmax MAF >0.1% and/or a number of
   alleles in gnomAD (AN) in the bottom decile. For exonic and intronic
   variants separately, we binned all indels by their precomputed CADD or
   SpliceAI score rounded to the nearest hundredth. The total number of
   indels within a deleteriousness score bin and all bins corresponding to
   higher deleteriousness scores was used as an approximation to the
   mutational target associated with that score.

Incorporation of different variant types

   Because there are different deleteriousness scores for coding and
   intronic variants and different mutational targets for SNV and indel
   variants, we expand our basic test statistic to accommodate different
   variant types
   [MATH: <mi>t</mi><mo>∈</mo> :MATH]
   {coding SNV, coding indel, intronic SNV, intronic indel}. We redefine a
   gene and variant mutational target with respect to each variant type as
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow
   ></msub><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>g</mi><mo>,</
   mo><mi>t</mi></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi
   ></mrow></msub> :MATH]
   7

   and
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>v</mi><mo>,<
   /mo><mi>t</mi></mrow></msub><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>g</mi><mo>,</
   mo><mi>t</mi></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi
   ></mrow></msub><msub><mrow><mi
   mathvariant="double-struck">1</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>
   o</mi><mi>r</mi><msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></m
   sub><mo>≥</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><msub><mrow><mi>e
   </mi></mrow><mrow><mi>v</mi></mrow></msub></mrow></msub> :MATH]
   8

   where g,t refers to the subset of all possible variants in gene g of
   type t. We define y' as
   [MATH:
   <msup><mrow><mi>y</mi></mrow><mrow><mo>′</mo></mrow></msup><mo>=</mo><m
   sub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>t</mi></mrow></msub><msubsup><mro
   w><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">1</mi><mo>−</mo><mfrac><mrow><msub><mrow><mi>μ</mi
   ></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><m
   i>i</mi></mrow></msub><mo>,</mo><mi>t</mi></mrow></msub></mrow><mrow><m
   sub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow><
   /msub></mrow></mfrac></mrow></mfenced><mo>~</mo><mi>I</mi><mi>H</mi><mf
   enced close=")"
   open="("><mrow><msup><mrow><mi>y</mi></mrow><mrow><mo>′</mo></mrow></ms
   up><mstyle mathsize="1.61em"><mfenced
   open="∣"><mrow></mrow></mfenced></mstyle><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>t</mi></mrow></msub><msub><mrow><
   mi>K</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></mfenced> :MATH]
   9

   where K[t] is the number of observed de novo mutations of variant type
   t landing in gene g. The expected number of de novos to land in gene g
   when considering different variant types is
   [MATH: <mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow><mo>≡</mo><msup><mrow><mi>λ</mi></mrow><mrow><mo>′</
   mo></mrow></msup><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>t</mi></mrow></msub><msub><mrow><
   mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow></msub><msub
   sup><mrow><mo mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>N</mi></mrow></msubsup><ms
   ub><mrow><mi>n</mi></mrow><mrow><mi>j</mi><mo>,</mo><mi>t</mi></mrow></
   msub> :MATH]
   10

   where
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>j</mi><mo>,</mo><mi>t</mi></mrow
   ></msub> :MATH]
   denotes the total number of observed de novo variants of type t in an
   individual j. For each variant type t, we scale
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow
   ></msub> :MATH]
   such that
   [MATH:
   <msub><mrow><mo>∑</mo></mrow><mrow><mi>g</mi></mrow></msub><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow></msub><mi
   mathvariant="normal">= 1</mi> :MATH]
   . We compute
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>P</mi><mi>o</mi><mi>i</mi><mi>s</mi><mrow><mo>(</mo>
   <mrow><mi>K</mi><mo>∣</mo><mi>λ</mi><mo>′</mo></mrow><mo>)</mo></mrow>
   :MATH]
   .

Cauchy-combination of P values computed with different deleteriousness
predictors

   We can run our method using different deleteriousness score predictions
   for coding SNVs (i.e., AlphaMissense, PrimateAI-3D, CADD, or REVEL),
   resulting in slightly different lists of genes with corresponding P
   values when incorporating this variant type. We combine these lists
   using the Cauchy combination test, an analytic calculation that is
   applicable under arbitrary dependence structures^[237]89.

Incorporation of GeneBayes values

   We incorporate GeneBayes values, which estimate the selection against
   heterozygous protein-truncating variants per gene, as weights in a
   weighted false discovery rate (FDR) procedure^[238]26,[239]91. We sort
   all genes in ascending order by their GeneBayes values. We then
   separate these sorted genes into 10 equally sized decile bins. For each
   gene g in each bin b[g], we compute a weight
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mi>g</mi></mrow></msub>
   :MATH]
   as
   [MATH:
   <msub><mrow><mi>w</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>=</mo><m
   n>10</mn><mo>⋅</mo><mfrac><mrow><mo>∣</mo><mi>D</mi><mi>D</mi><mo>∈</mo
   ><mi>b</mi><mrow><mo>[</mo><mrow><mi>g</mi></mrow><mo>]</mo></mrow><mo>
   ∣</mo></mrow><mrow><mo>∣</mo><mi>D</mi><mi>D</mi><mo>∣</mo></mrow></mfr
   ac> :MATH]
   11

   where DD is the set of exclusively dominant disease-causing genes as
   annotated in OMIM (accessed December 2023). Genes without GeneBayes
   values are assigned a weight
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mi>g</mi></mrow></msub><mi
   mathvariant="normal">= 1</mi> :MATH]
   . Note that
   [MATH:
   <mi>E</mi><mrow><mo>[</mo><mrow><msub><mrow><mi>w</mi></mrow><mrow><mi>
   g</mi></mrow></msub></mrow><mo>]</mo></mrow><mo>=</mo><mn>1</mn> :MATH]
   and that GeneBayes values, which are constant for all variants within a
   given gene, are independent from y and y' values, which vary for
   variants within a gene based on variant mutational targets and
   deleteriousness scores. This enables us to perform Benjamini-Hochberg
   false discovery rate correction on weighted Q-values computed for each
   Pr(y') as
   [MATH:
   <mi>Q</mi><mo>=</mo><mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>y</mi
   ><mo>′</mo></mrow><mo>)</mo></mrow><mo>/</mo><msub><mrow><mi>w</mi></mr
   ow><mrow><mi>g</mi></mrow></msub> :MATH]
   .26

Massively parallel splicing reporter assay (MPSA)

Assay design

   We designed oligonucleotides to evaluate the impact of a variant
   predicted to cause a cryptic splice site gain or a canonical splice
   site loss. For each variant with a predicted splice-altering impact, we
   extracted the surrounding genomic sequence from the UDN patient
   harboring the variant (alternate) as well as a paired version with the
   variant of interest replaced with the reference allele (reference). We
   centered the candidate sequence on the variant of interest, noting that
   the impacted splice site junction could be up to 50 nucleotides away
   from the variant. For a subset of variants, we also generated candidate
   sequences that were centered on the predicted site of the altered
   splice junction rather than on the variant itself. We embedded each
   candidate sequence in an oligonucleotide template containing a 4-nt
   barcode and flanking primers as follows:

Splice donor library structure

   GCACGGACAAAGTACTAGCC [155-nt candidate sequence][4-nt SD-associated
   barcode] GGAAGATCGACGCAGgtaagt

Splice acceptor library structure

   TGCTCTTATGCGAACGTGTTAAC [4-nt SA-associated barcode] [151-nt candidate
   sequence] GGAAGATCGACGCAGgtaagtt

   The final oligonucleotide library contained 6000 200-nt
   oligonucleotides, encompassing 1920 alternate/reference pairs, which we
   ordered from Twist Bioscience.

Library cloning and experimental protocol

   The oligonucleotide library was cloned separately using PCR
   amplification and NEBuilder assembly into lentiviral splice acceptor
   (pLenti-MPSA-acceptor) and splice donor (pLenti-MPSA-donor) vectors.
   These vectors consisted of an EF1A promoter and an mCherry open reading
   frame (ORF) followed by splicing reporter modules based off of prior
   published massively parallel splicing reporter
   constructs^[240]92,[241]93 (Supplementary Fig. [242]7) as well as a
   separate Puromycin selection cassette. Plasmids have been deposited to
   Addgene under accession numbers 240805 (“plentiMPSA SAentry PuroR'')
   for splice acceptors and 240806 (“plentiMPSA SDentry PuroR'') for
   splice donors.

   Lentiviral particles for each library were produced and titrated. Each
   library was transduced at a multiplicity of infection (MOI) of 0.3 in
   three biological replicates into 6.25*106 cells/replicate of HepG2
   (liver cell line derived from primary cells extracted from a white,
   male, 15-year-old with liver cancer, catalog #HB-8065) and SK-N-SH
   (neural-like cell line derived from primary cells extracted from a
   female 4-year-old, catalog #HTB-11), both acquired from American Type
   Culture Collection (ATCC). Cells were routinely tested for mycoplasma
   contamination via qPCR, and all tests were consistently negative for
   mycoplasma. Cells were selected with Puromycin to completion, and
   genomic DNA and RNA were harvested one week after transduction.

   PCR-based nextgen sequencing (NGS) library preparation was performed on
   all 12 genomic DNA and RNA samples. Libraries were sequenced with 75-nt
   paired-end reads using an Illumina NextSeq 500 sequencer, ensuring an
   average of >1000 reads per library member from all libraries.

Barcode mapping

   Over ~75% of all RNA reads could be mapped back to a 15-nt barcode
   found in our starting dictionary. This resulted in ~6—15 million mapped
   RNA reads per MPSA replicate, yielding a median of 1170 mapped reads
   per alternate/reference library pair per replicate. Results from
   Tapestation, an automated electrophoresis system for sizing and
   quantifying nucleic acid samples, showed that 49.6% of mapped reads
   from splice donor MPSA experiments utilized some library splice donor
   site and 50.4% utilized the experimentally fixed site. Across splice
   acceptor MPSA experiments, 58.3% of mapped reads utilized some library
   splice acceptor site and 41.7% utilized the fixed site.

MPSA validation rate

   We considered all alternate/reference library pairs with at least 10
   barcode-disambiguated mapped reads each in one or more MPSA
   experiments; 99.4% of pairs met this requirement. Each read was then
   categorized as (1) using the experimentally fixed splice site, (2)
   using a splice site corresponding to a known intron/exon junction as
   annotated in Ensembl, (3) using the SpliceAI-predicted cryptic splice
   gain site, (4) using a cryptic splice site at a different location, (5)
   malformed where the read did not begin with the correct fixed sequence
   due to a next-generation sequencing error, or (6) recombined where the
   read did not align to the expected oligo sequence at all. The percent
   of malformed and recombined reads per alternate/reference pair was 7.5%
   (SD=1.9%) and 6.2% (SD=10.6%) respectively on average. The position of
   SpliceAI-predicted cryptic splice sites often did not correspond to the
   expected splice junction based on manual inspection or to the splice
   sites observed in MPSA experiments (55.4% of splice acceptor and 5.4%
   of splice donor predicted positions matched). We instead considered the
   most common cryptic splice site position observed in each alternate
   library sequence to be the predicted site. MPSA validation rate is
   computed per alternate/reference library pair as the difference in
   percentages of total reads supporting the predicted cryptic splice site
   between oligos containing the alternate variant and the corresponding
   reference oligonucleotides (Supplementary Fig. [243]8a).

   We compared the MPSA validation rates across the three biological
   replicates and two cell types using Pearson's correlation
   (Supplementary Fig. [244]8b).

DeNovoWEST gene-specific enrichment of de novo variants

   We modified the DeNovoWEST weighted permutation test by first
   augmenting the set of variants under consideration beyond exonic
   variants to include all possible intronic variants in protein-coding
   genes with a SpliceAI score >0.4, resulting in ~400k additional
   possible variants under consideration.41 To this end, we modified the
   codebase (version v1.0.0) to consider these intronic putatively
   splice-altering variants to have the same functional consequence as
   canonical splice site variants if they had a VEP annotation of
   “splice_acceptor'' or “splice_donor'' or the same functional
   consequence as missense variants otherwise. We then updated the
   required precomputed values, including per-variant mutation rates,
   minor allele frequencies, deleteriousness scores and per-region
   constraint values as detailed below, for all exonic and intronic
   variants under consideration (Supplementary Fig. [245]9a). The
   underlying triplet-context mutational model was replaced with
   genome-wide, per-SNV Roulette mutation rate estimates^[246]94. Each
   variant's minor allele frequency was set to the maximum gnomAD-v3
   population or TOPMed allele frequency. Per-variant Phred-scaled and
   unscaled CADD values were obtained from
   ([247]https://cadd.gs.washington.edu/) (version 1.6 for GRCh38/hg38).
   Updated per-gene shet values were obtained from
   ([248]http://genetics.bwh.harvard.edu/genescores/selection.html) and
   binned into a “low'' category if mean shet was below 0.15 and a “high''
   category otherwise.27 Notably, some stable Ensembl gene IDs in
   GRCh37/hg19 are not present in GRCh38/hg38 and vice versa; all variants
   from the 894 GRCh38/hg38 genes without shet values are binned into the
   “low'' category. Regional missense constraint values, defined for
   adjacent windows covering the full genomic region of each
   protein-coding gene were obtained from
   ([249]https://gnomad.broadinstitute.org/downloads#exac-regional-missens
   e-constraint). We translated these genomic region coordinates from
   GRCh37/hg19 to GRCh38/hg38 using UCSC's LiftOver tool and then assigned
   a constraint value to exonic and intronic variants corresponding to the
   genomic region they fell into. We recomputed the weights assigned to
   each variant type using the union of all de novo variants in our cohort
   and the de novo variants released with DeNovoWEST (encompassing ~31,000
   exome-only trios), because the distribution of de novo variant classes
   in UDN data was very similar to the distribution of de novo variant
   classes in the dataset used by DeNovoWEST (Supplementary
   Fig. [250]9b-c) and because the authors warn that weights generated
   from smaller datasets alone may be unreliable. Gene severity scores
   were then computed for every gene harboring one or more de novo
   variants across our cohort. We adjust DeNovoWEST assigned p-values
   using Bonferroni correction for twice the total number of genes
   evaluated as suggested by the authors. We find that DeNovoWEST and
   RaMeDiES-DN (using only CADD in exonic regions as a closer comparison
   to DeNovoWEST) recovered known autosomal dominant disease genes at a
   comparable rate across de novo variants provided in the original
   DeNovoWEST paper (Supplementary Fig. [251]16).

Analytical test for compound heterozygous cohort-level recurrence

   A compound heterozygous configuration is an independent occurrence of
   two variants: one maternally (M) and the other paternally (D)
   inherited. Here we use the term “compound heterozygous configuration''
   to refer to either two distinct heterozygous variants inherited from
   either parent, or a single homozygous recessive variant where both
   independently-inherited alleles arose from independent mutational
   events (Supplementary Note [252]4). The mutational target of a compound
   heterozygous configuration should therefore lie in a space of squared
   mutational targets. We define the mutational target of a compound
   heterozygous configuration as
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>v<
   /mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub><mo>=</mo><mi>ma
   x</mi><msup><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></mrow><m
   row><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>M</mi></
   mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow>
   <mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>D</mi></mrow
   ></msub></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><mn>2</mn></
   mrow></msup> :MATH]
   12

   where
   [MATH: <msub><mrow><mi>v</mi></mrow><mrow><mi>M</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>v</mi></mrow><mrow><mi>D</mi></mrow></msub>
   :MATH]
   are maternally and paternally inherited variants comprising a compound
   heterozygous configuration, and
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>M</mi></mrow></msub></mrow></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub> :MATH]
   are computed as in Eq. [253]12. To prioritize compound heterozygous
   configurations with both deleterious variants, we use the maximum over
   per-variant mutational targets. A single deleterious variant in a
   compound heterozygous configuration may indicate carrier status rather
   than a compelling candidate for a rare disorder. By this definition,
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>v<
   /mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub> :MATH]
   is uniformly distributed at null (Supplementary Note [254]4). This
   enables us to define a similarly constructed statistic
   [MATH: <msup><mrow><mi>y</mi></mrow><mrow><mi>c</mi></mrow></msup>
   :MATH]
   modelable by the Irwin-Hall distribution as in the case of recurrent de
   novos (Eq. [255]4):
   [MATH:
   <msup><mrow><mi>y</mi></mrow><mrow><mi>c</mi></mrow></msup><mo>=</mo><m
   subsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><msub><mrow><mi>μ</mi><
   /mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>
   M</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi
   ></mrow><mrow><mi>D</mi><mo>,</mo><mi>i</mi></mrow></msub></mrow></msub
   ></mrow><mrow><msup><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi>
   </mrow></msub></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac></mro
   w></mfenced><mo>~</mo><mi>I</mi><mi>H</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>y</mi></mrow><mrow><mi>c</mi></mrow></ms
   up><mo>∣</mo><mi>K</mi></mrow></mfenced> :MATH]
   13

   where K is the number of compound heterozygous configurations
   independently landing in gene g across the cohort, and
   [MATH:
   <msub><mrow><mi>v</mi></mrow><mrow><mi>M</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>v</mi></mrow><mrow><mi>D</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   are the maternally and paternally inherited variants in gene g in
   individual j. As before, K is approximately Poisson distributed, and
   parameter
   [MATH: <msup><mrow><mi>λ</mi></mrow><mrow><mi>c</mi></mrow></msup>
   :MATH]
   , the expected number of compound heterozygous configurations to land
   in gene g, is given by
   [MATH: <mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow><mo>≡</mo><msup><mrow><mi>λ</mi></mrow><mrow><mi>c</
   mi></mrow></msup><mo>=</mo><msup><mrow><msub><mrow><mi>μ</mi></mrow><mr
   ow><mi>g</mi></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><msubsu
   p><mrow><mo mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>N</mi></mrow></msubsup><ms
   ub><mrow><mi>n</mi></mrow><mrow><mi>M</mi><mo>,</mo><mi>j</mi></mrow></
   msub><msub><mrow><mi>n</mi></mrow><mrow><mi>D</mi><mo>,</mo><mi>j</mi><
   /mrow></msub> :MATH]
   14

   where
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>M</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>D</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   are the numbers of maternally and paternally inherited rare variants in
   an individual j, respectively. We compute
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>P</mi><mi>o</mi><mi>i</mi><mi>s</mi><mrow><mo>(</mo>
   <mrow><mi>K</mi><mo>∣</mo><msup><mrow><mi>λ</mi></mrow><mrow><mi>c</mi>
   </mrow></msup></mrow><mo>)</mo></mrow> :MATH]
   as before.

   Finally, we extend this basic test statistic to accommodate 16 compound
   heterozygous configuration types as
   [MATH:
   <mrow><mo>(</mo><mrow><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mr
   ow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow
   ></msub></mrow><mo>)</mo></mrow><mo>∈</mo> :MATH]
   {coding SNV, coding indel, intronic SNV, intronic indel}2 and define
   [MATH:
   <msup><mrow><mi>y</mi></mrow><mrow><msup><mrow><mi>c</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow></msup> :MATH]
   and Poisson parameter
   [MATH:
   <msup><mrow><mi>λ</mi></mrow><mrow><msup><mrow><mi>c</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow></msup> :MATH]
   accordingly as
   [MATH:
   <msup><mrow><mi>y</mi></mrow><mrow><msup><mrow><mi>c</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow></msup><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><m
   i>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>
   D</mi></mrow></msub></mrow></msub><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><msub><mrow><mi>K</mi></mrow><
   mrow><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</
   mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></
   msub></mrow></msubsup><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">1</mi><mo>−</mo><mfrac><mrow><msub><mrow><mi>μ</mi
   ></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><m
   i>M</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</
   mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi
   ></mrow><mrow><mi>D</mi><mo>,</mo><mi>i</mi></mrow></msub><mo>,</mo><ms
   ub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub><
   /mrow><mrow><mi>max</mi><msup><mrow><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><m
   o>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub></mr
   ow></msub><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,<
   /mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow><
   /msub></mrow><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mr
   ow></mfrac></mrow></mfenced> :MATH]
   15

   and
   [MATH: <mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow><mo>≡</mo><msup><mrow><mi>λ</mi></mrow><mrow><msup><
   mrow><mi>c</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow></msup><mo>=
   </mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><m
   i>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>
   D</mi></mrow></msub></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><m
   i>g</mi><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow><
   /msub></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,<
   /mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow><
   /msub><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>N</mi></mrow></msubsup><ms
   ub><mrow><mi>n</mi></mrow><mrow><mi>M</mi><mo>,</mo><msub><mrow><mi>t</
   mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</mo><mi>j</mi></mrow></ms
   ub><msub><mrow><mi>n</mi></mrow><mrow><mi>D</mi><mo>,</mo><msub><mrow><
   mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub><mo>,</mo><mi>j</mi></mro
   w></msub> :MATH]
   16

   where
   [MATH:
   <msub><mrow><mi>K</mi></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><
   mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi
   >D</mi></mrow></msub></mrow></msub> :MATH]
   is the number of compound heterozygous configurations in gene g across
   the cohort where the maternally inherited variant is of type
   [MATH: <msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub>
   :MATH]
   and the paternally inherited variant is of type
   [MATH: <msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub>
   :MATH]
   . Instances where
   [MATH:
   <msub><mrow><mi>K</mi></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><
   mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi
   >D</mi></mrow></msub></mrow></msub><mi mathvariant="normal">= 0</mi>
   :MATH]
   are excluded from the above sums. For each variant type t, we scale
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow
   ></msub> :MATH]
   such that
   [MATH:
   <msub><mrow><mo>∑</mo></mrow><mrow><mi>g</mi></mrow></msub><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><mi>t</mi></mrow></msub><mi
   mathvariant="normal">= 1</mi> :MATH]
   . We compute the probability of
   [MATH:
   <msup><mrow><mi>y</mi></mrow><mrow><msup><mrow><mi>c</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow></msup> :MATH]
   as in Eq. [256]5. Note that inherited homozygous recessive variants
   that arose from a single, ancestral mutational event violate the
   assumptions of our approach and are excluded (Supplementary
   Note [257]4).

Modeling false positive diagnoses

   For any gene where the observed number of variants
   [MATH: <mi>K</mi><mo>></mo><mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow> :MATH]
   across the cohort, we suspect that there are some true diagnoses in
   specific patients as well as some “false positives'' where a randomly
   occurring variant in a patient is unrelated to the patient's condition.
   We use the binomial distribution parameterized by K independent trials
   and probability of success per trial
   [MATH: <mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow><mo>/</mo><mi>N</mi> :MATH]
   to estimate the proportion of false positive diagnoses for each gene.

Analytical test for individual-level compound heterozygous configuration

   Given a set of independent compound heterozygous configurations across
   genes in a single individual's genome, we construct a test for the
   hypothesis of a monogenic, recessive disorder caused by one of these
   compound heterozygous configurations against the null. We assume up to
   one compound heterozygous configuration per gene, i.e., for each gene
   g,
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>M</mi></mrow></msub><msub><mrow>
   <mi>n</mi></mrow><mrow><mi>D</mi></mrow></msub><msup><mrow><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub></mrow><mrow><mn>2</mn><
   /mrow></msup><mo>≪</mo><mn>1</mn> :MATH]
   , where
   [MATH:
   <msub><mrow><mo>∑</mo></mrow><mrow><mi>g</mi></mrow></msub><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub><mi
   mathvariant="normal">= 1</mi> :MATH]
   and
   [MATH: <msub><mrow><mi>n</mi></mrow><mrow><mi>M</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>n</mi></mrow><mrow><mi>D</mi></mrow></msub>
   :MATH]
   are the numbers of maternally and paternally inherited rare variants in
   this individual's genome.

   We now rescale the mutational target of a compound heterozygous
   configuration (Eq. [258]12) with respect to all genes g in the genome
   as
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>M<
   /mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>D</m
   i></mrow></msub></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>G</mi></mrow>
   </msub><mi>m</mi><mi>i</mi><mi>n</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><msup><mrow><msub><mrow><mi>μ</mi></mrow><mrow>
   <mi>i</mi></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo>
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>v<
   /mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub></mrow><mo>)</mo
   ></mrow></mrow><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>G</mi></mrow>
   </msub><msup><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow>
   </msub></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac><mi
   mathvariant="normal"></mi><mo>.</mo> :MATH]
   17

   Intuitively, this corresponds to the probability of observing a
   compound heterozygous configuration with an equal or smaller (i.e.,
   more surprising) mutational target occurring in any gene in the genome.
   Thus,
   [MATH: <mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >~</mo><msub><mrow><mi>U</mi></mrow><mrow><mi
   mathvariant="normal">[0,1]</mi></mrow></msub> :MATH]
   . We precompute each gene's compound heterozygous mutational target
   [MATH:
   <msup><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub>
   </mrow><mrow><mn>2</mn></mrow></msup> :MATH]
   for all genes in the genome in order to quickly compute
   [MATH: <mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover>
   :MATH]
   values for each observed compound heterozygous configuration in an
   individual.

   Next, we define our statistic
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup> :MATH]
   per individual as the minimal observed rescaled compound heterozygous
   mutational target:
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup><mo>=</mo><mi>m</mi><mi>i</mi><mi>n</
   mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</
   mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >K</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   18

   We compute the probability of observing a
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup> :MATH]
   value this low or lower given K total genes with observed compound
   heterozygous configurations in an individual's genome as
   [MATH: <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup><mo>∣</mo><mi>K</mi></mrow><mo>)</mo>
   </mrow><mo>=</mo><mn>1</mn><mo>−</mo><msubsup><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>i</mi><mi
   mathvariant="normal">=1</mi></mrow><mrow><mi>K</mi></mrow></msubsup><mi
   >P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>Y</mi><mo>></mo><msup><mrow>
   <mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup></mrow><mo>)</mo></mrow><mo>=</mo><mn
   >1</mn><mo>−</mo><mi>P</mi><mi>r</mi><msup><mrow><mrow><mo>(</mo><mrow>
   <mi>Y</mi><mo>></mo><msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup></mrow><mo>)</mo></mrow></mrow><mrow>
   <mi>K</mi></mrow></msup><mo>.</mo> :MATH]
   19

   where Y is a dummy variable. Because
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup> :MATH]
   is uniformly distributed on [0,1],
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>Y</mi><mo>></mo><msup><mr
   ow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup></mrow><mo>)</mo></mrow><mo>=</mo><mr
   ow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup></mrow><mo>)</mo></mrow> :MATH]
   , so we simplify this calculation as
   [MATH: <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup><mo>∣</mo><mi>K</mi></mrow><mo>)</mo>
   </mrow><mo>=</mo><mn>1</mn><mo>−</mo><msup><mrow><mrow><mo>(</mo><mrow>
   <mi mathvariant="normal">1</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup></mrow><mo>)</mo></mrow></mrow><mrow>
   <mi>K</mi></mrow></msup> :MATH]
   20

   We also model the distribution of K observed compound heterozygous
   configurations across an individual's genome in order to compute the
   overall probability of our statistic
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup> :MATH]
   using the same formulation as before (Eq. [259]5). The distribution of
   K, given our prior assumption of at most one compound heterozygous
   configuration per gene, has an exact solution as the number of double
   events in a bivariate binomial distribution with correlation parameter
   [MATH: <mi>ρ</mi> :MATH]
   capturing the effect of different gene lengths on K. However, due to
   the complexity in calculations of the exact solution, here we use the
   Poisson approximation instead because, for each gene g,
   [MATH:
   <msub><mrow><mo>∑</mo></mrow><mrow><mi>g</mi></mrow></msub><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub><mi
   mathvariant="normal">= 1</mi> :MATH]
   and
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>M</mi></mrow></msub><msub><mrow>
   <mi>n</mi></mrow><mrow><mi>D</mi></mrow></msub><msup><mrow><msub><mrow>
   <mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub></mrow><mrow><mn>2</mn><
   /mrow></msup><mo>≪</mo><mn>1</mn> :MATH]
   . The
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup> :MATH]
   parameter for the Poisson approximation in this case is
   [MATH: <mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><mi>K</mi></mro
   w><mo>]</mo></mrow><mo>≡</mo><msup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>c</mi></mrow></msup><mo>=</mo><msub><mrow><mi>n</mi></mro
   w><mrow><mi>M</mi></mrow></msub><msub><mrow><mi>n</mi></mrow><mrow><mi>
   D</mi></mrow></msub><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>G</mi></mrow></msubsup><msup><mrow><msub><mrow><mi>μ</mi></mr
   ow><mrow><mi>i</mi></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   21

   Finally, we accommodate the 16 compound heterozygous configuration
   types as
   [MATH:
   <mrow><mo>(</mo><mrow><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mr
   ow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow
   ></msub></mrow><mo>)</mo></mrow><mo>∈</mo> :MATH]
   {coding SNV, coding indel, intronic SNV, intronic indel}2 and redefine
   [MATH: <mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >′</mo> :MATH]
   ,
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><msup><mrow><mi>c</mi></mrow><mrow><mo>′</mo></mrow></msup></
   mrow></msup> :MATH]
   and Poisson parameter
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><msup><mrow><mi>c</mi></mrow><mrow><mo>′</mo></mrow></msup></
   mrow></msup> :MATH]
   accordingly as
   [MATH: <msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>g</mi><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>M<
   /mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>D</m
   i></mrow></msub></mrow><mrow><mo>′</mo></mrow></msubsup><mo>=</mo><mfra
   c><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><m
   i>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>
   D</mi></mrow></msub></mrow></msub><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>G</mi></mrow>
   </msub><mi>min</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></m
   row><mrow><mi>i</mi><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>M<
   /mi></mrow></msub></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>
   i</mi><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></m
   sub></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</m
   i><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>M</mi><mo>,</mo><msu
   b><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub></mrow></msub><m
   o>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>D</mi><mo>,</mo><msub><m
   row><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow></msub></mrow
   ></msub></mrow><mo>)</mo></mrow></mrow><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><m
   i>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>
   D</mi></mrow></msub></mrow></msub><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>G</mi></mrow>
   </msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</mo><msub><mr
   ow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub></mrow></msub><msub><
   mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</mo><msub><mrow><mi>t</mi><
   /mrow><mrow><mi>D</mi></mrow></msub></mrow></msub></mrow></mfrac>
   :MATH]
   22

   and
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo>~</mo></mover></mrow><mrow><ms
   up><mrow><mi>c</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow></msup><
   mo>=</mo><mi>m</mi><mi>i</mi><mi>n</mi><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>~</mo></mover></mrow><mrow><mn
   >1</mn></mrow><mrow><mo>′</mo></mrow></msubsup><mo>,</mo><mo>.</mo><mo>
   .</mo><mo>.</mo><mo>,</mo><msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >K</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow><mo>)</mo></mrow>
   :MATH]
   23

   and
   [MATH: <mi mathvariant="double-struck">E</mi><mspace
   width="0.16em"></mspace><mrow><mo>[</mo><mrow><mi>K</mi></mrow><mo>]</m
   o></mrow><mo>≡</mo><msup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><msup><mrow><mi>c</mi></mrow><mrow><mo>′</mo></mrow></msup></
   mrow></msup><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><m
   i>M</mi></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>
   D</mi></mrow></msub></mrow></msub><msub><mrow><mi>n</mi></mrow><mrow><m
   i>M</mi><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>M</mi></mrow><
   /msub></mrow></msub><msub><mrow><mi>n</mi></mrow><mrow><mi>D</mi><mo>,<
   /mo><msub><mrow><mi>t</mi></mrow><mrow><mi>D</mi></mrow></msub></mrow><
   /msub><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>G</mi></mrow>
   </msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</mo><msub><mr
   ow><mi>t</mi></mrow><mrow><mi>M</mi></mrow></msub></mrow></msub><msub><
   mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</mo><msub><mrow><mi>t</mi><
   /mrow><mrow><mi>D</mi></mrow></msub></mrow></msub> :MATH]
   24

Enrichment for correct diagnoses

   Given a ranked list of genes across a cohort of patients, where each
   gene may be diagnostic for the given patient, we can compute enrichment
   for correct diagnoses at each gene rank. We use Fisher's exact test to
   compare the proportion of complete, certain diagnoses in all genes up
   to and including rank k compared to the proportion of correct diagnoses
   at genes ranked k+1 through the end of the list. We consider the
   minimum Fisher's exact test P across all k to be our overall
   enrichment. We assign a permutation-based P value to this enrichment
   value by randomly permuting the initial gene list 10,000 times and
   recomputing the minimum Fisher's exact test P for each permuted list.

Transcriptome sequencing analysis for MED11

RNA extraction, sequencing and quality control

   RNA was extracted from UDN patients' whole blood samples received at
   UCLA between 2018 and 2019 using PAXgene Blood RNA extraction kits from
   Qiagen. Concentration of RNA in each sample was quantified using the
   Qubit 3.0 Fluorometer. RNA integrity numbers (RINs), a quality control
   measure, were assessed per sample using the Agilent bioanalyzer. RNA
   libraries were prepared for each sample using either the NuGEN
   Universal Plus mRNA-Seq kit or the Illumina TruSeq mRNA + Globin Minus
   kit. Sequencing was then performed on the Illumina NovaSeq 6000 to
   generate ~50-100 million 100-150bp paired-end reads per sample. Library
   preparation and sequencing were performed at the UCLA Neuroscience
   Genomics Core and the UCLA Technology Center for Genomics and
   Bioinformatics Core. Sequenced reads in FASTQ format were aligned to
   human reference genome GRCh37 using STAR v2.5.2b with default
   parameters and Gencode v19 annotations^[260]95,[261]96. To increase
   sensitivity to novel splice junctions, reads were mapped using the STAR
   2-pass mode, where novel splice junctions detected during the first
   pass alignment are indexed and used alongside known splice junctions in
   the second pass remapping. We confirmed effective ribosomal RNA (rRNA)
   depletion per sample by aligning all paired-end reads to the complete
   sequences for nuclear and mitochondrial rRNAs using BWA-mem
   v0.7.17^[262]97 and ensured that the proportion of aligned reads did
   not suggest excessive rRNA contamination. Duplicate reads were marked
   using PicardTools v4.2.4.0 and post-alignment sequencing quality was
   assessed using RNA-SeQC v1.1.8 to ensure adequate library
   complexity^[263]98,[264]99. RNA sample identity was confirmed by
   comparing single nucleotide variant (SNV) calls from RNA sequencing to
   SNV calls from corresponding exome or genome sequencing data per
   sample.

Intron retention outlier analysis

   Fifty-three tissue-matched control samples from UDN participants
   unrelated to the proband, mother and father were selected for outlier
   analysis. IRFinder v1.2.456 was run in BAM mode using the same human
   reference GRCh37 and Gencode v19 annotations on aligned BAM files to
   measure the level of intron retention (i.e., “IRratio'') in MED11
   across the proband, mother, father, and control samples. The IRratio is
   computed per sample as (median read depth of first intron) / (number of
   reads spanning first and second exons + median read depth of first
   intron). Aligned reads covering the MED11 gene region (i.e.,
   chr17:4,634,723-4,636,903) from the proband, mother, father and two
   control samples were viewed using a local installation of the
   Integrative Genomics Viewer (IGV) v2.16.0^[265]100.

Pathway enrichment analysis

Phenotypically-similar patient groupings

   Phrank (version v2018-12-13) was used to compute all-against-all
   pairwise phenotype similarity scores between all affected patients'
   sets of standardized HPO terms. We normalized these scores by dividing
   by the maximum self-similarity score in each pair^[266]101. UDN
   patients experience a spectrum of symptoms across overlapping
   biological categories and therefore cannot be easily separated into
   distinct, well-defined clusters (Supplementary Fig. [267]17). We
   iteratively grouped similar patient pairs using complete-linkage
   hierarchical clustering with the agnes function from R's cluster
   package (version v2.1.4), which allows for patient groups of different
   sizes while minimizing the maximum distance between any two patients in
   the same cluster. We assigned patients to clusters by cutting the
   resulting dendrogram at height=3.5, resulting in 120 clusters of 2--97
   patients per cluster (mean=22, median=17).

Selecting genes per patient cluster

   We identify genes per patient cluster as follows. First, we consider
   known diagnoses for all UDN patients in that cluster. For patients with
   a diagnosis that was “complete'' (i.e., explained all symptoms
   including asserted phenotypes), no further genes are considered. For
   patients with no diagnosis or at most one “partial'' diagnosis, we then
   consider genes with an exonic (or intronic with a SpliceAI score >0.4)
   de novo variant and assign each gene its variant's severity weight (s)
   from our modified DeNovoWEST procedure. Recall that weights are
   assigned per variant class based on functional impact (e.g.,
   frameshift, nonsense, missense), variant deleteriousness, and gene
   constraint. Autosomal de novos are considered as before in addition to
   de novos on chromosome X with gnomAD population maximum allele
   frequency < 0.0001, TOPMed allele frequency <0.0001, proband alternate
   allele read depth >20 and frequency >0.2 (for females) and alternate
   read depth >20 in both parents. We consider the most significant gene
   per patient with
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>s</mi></mrow><mo>)</mo></
   mrow><mo><</mo><mn>0.0005</mn> :MATH]
   , where
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>s</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>S</mi><mo>
   ≥</mo><mi>s</mi><mo>∣</mo><mi>K</mi><mo>=</mo><mn>1</mn></mrow><mo>)</m
   o></mrow><mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>K</mi><mo>=</mo>
   <mn>1</mn></mrow><mo>)</mo></mrow> :MATH]
   .
   [MATH:
   <mi>P</mi><mi>r</mi><mrow><mo>(</mo><mrow><mi>S</mi><mo>≥</mo><mi>s</mi
   ><mo>∣</mo><mi>K</mi><mo>=</mo><mn>1</mn></mrow><mo>)</mo></mrow>
   :MATH]
   is computed exactly using precomputed per-variant Roulette mutation
   rates and variant weights per gene. Pr(K=1) is computed assuming
   mutations follow a Poisson distribution with
   [MATH:
   <mi>λ</mi><mo>=</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mi>e<
   /mi><mi>n</mi><mi>e</mi></mrow></msub> :MATH]
   for genes falling on chromosome X in males and
   [MATH: <mi>λ</mi><mi mathvariant="normal">=
   2</mi><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi><mi>e</mi><mi>n</mi>
   <mi>e</mi></mrow></msub> :MATH]
   otherwise. Finally, in patients who still have fewer than two genes at
   this point and no complete diagnoses, we include up to one additional
   gene harboring a compound heterozygous variant pair that ranked in the
   top 100 in our RaMeDiES-IND cohort-wide per-individual analysis, as
   there was significant enrichment for correct diagnoses in this set
   (Fisher's exact test P = 5.23e-4). Across all patient clusters, we
   considered 70 genes (6.15%) with de novo variants corresponding to
   known diagnoses, 10 genes (0.88%) with compound heterozygous variants
   corresponding to known diagnoses, 562 (49.38%) other known diagnostic
   genes, 434 genes (38.14%) with new de novo candidates, and 62 genes
   (5.45%) with new compound heterozygous candidates.

Gene set enrichment analysis (GSEA)

   The genes found across all patients in each patient cluster were used
   as a query set for gene set enrichment analysis (GSEA) using g:Profiler
   (version e108_eg55_p17)^[268]102. We considered Reactome and KEGG
   biological pathway gene sets of size <150 genes and set our background
   gene set to all human genes annotated in Ensembl. Enrichment P values
   are adjusted using g:Profilerâ€™s g:SCS approach^[269]103. Briefly, for
   every query gene set size, 2000 random gene sets of the same size are
   used as queries for GSEA with the same parameters, and the lowest
   pathway enrichment P value is recorded for each random query set. A
   threshold t is selected for each query gene set size as the 5% quantile
   of these random minimum P values. Enrichment P values resulting from
   the true gene query are then adjusted by multiplying by 0.05/t.

Reporting summary

   Further information on research design is available in the [270]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [271]Supplementary Information^ (8.3MB, pdf)
   [272]41467_2025_61712_MOESM2_ESM.pdf^ (126.6KB, pdf)

   Description of Additional Supplementary Files
   [273]Supplementary Data 1-9^ (3.7MB, xlsx)
   [274]Reporting Summary^ (3.4MB, pdf)
   [275]Transparent Peer Review file^ (25.5KB, pdf)

Acknowledgements