Abstract Genomics for rare disease diagnosis has advanced at a rapid pace due to our ability to perform in-depth analyses on individual patients with ultra-rare diseases. The increasing sizes of ultra-rare disease cohorts internationally newly enables cohort-wide analyses for new discoveries, but well-calibrated statistical genetics approaches for jointly analyzing these patients are still under development. The Undiagnosed Diseases Network (UDN) brings multiple clinical, research and experimental centers under the same umbrella across the United States to facilitate and scale case-based diagnostic analyses. Here, we present the first joint analysis of whole genome sequencing data of UDN patients across the network. We introduce new, well-calibrated statistical methods for prioritizing disease genes with de novo recurrence and compound heterozygosity. We also detect pathways enriched with candidate and known diagnostic genes. Our computational analysis, coupled with a systematic clinical review, recapitulated known diagnoses and revealed new disease associations. We further release a software package, RaMeDiES, enabling automated cross-analysis of deidentified sequenced cohorts for new diagnostic and research discoveries. Gene-level findings and variant-level information across the cohort are available in a public-facing browser ([59]https://dbmi-bgm.github.io/udn-browser/). These results show that case-level diagnostic efforts should be supplemented by a joint genomic analysis across cohorts. Subject terms: Statistical methods, Diseases, Data integration __________________________________________________________________ Using well-calibrated statistical methods the authors jointly analyze Undiagnosed Diseases Network genomes, identifying known and novel disease genes. Software is publicly available to support future cross-cohort rare disease discovery efforts. Introduction For decades preceding the widespread application of DNA sequencing, identifying the genetic etiology of rare monogenic phenotypes including human diseases relied on segregation in pedigrees^[60]1. DNA sequencing enabled the analysis of sporadic cases with no segregation data^[61]2. Early studies analyzed small cohorts of phenotypically similar cases^[62]3,[63]4, a highly successful approach that is, however, limited to diseases with multiple known patients with fairly homogeneous presentations. In the absence of such phenotypically matched case cohorts, per-case studies of individual, undiagnosed patients are gaining popularity^[64]5–[65]8. By design, these studies cannot attain statistical power from the shared genotypes of unrelated patients and require extensive clinical and biological inquiry to prove the causal involvement of the genotype in disease^[66]9–[67]11. The most recent phase of human Mendelian genetics employs a data science approach to gene discovery propelled by the joint genomic analysis of phenotypically broad cohorts. Recent studies by the Deciphering Developmental Disorders and 100,000 Genomes consortia have demonstrated the power of this approach to identify new diagnoses and disease genes^[68]12–[69]14. This opens the prospect of international cross-cohort analyses, leveraging parallel efforts in many countries, and appreciating that rare diseases know no borders. Undiagnosed diseases network dataset Here, we apply existing and newly developed statistical genetics methods to the Undiagnosed Diseases Network (UDN) cohort that includes extremely difficult-to-solve, likely genetic cases (Fig. [70]1a–e). The unique, diagnostically elusive presentation is the only criterion for inclusion, and patients have varied presentations including neurological, musculoskeletal, immune, endocrine, cardiac, and other disorders. Symptom onset ranges from neonatal through late adulthood. In contrast to most existing rare disease cohorts, individuals accepted to the UDN have already undergone lengthy but ultimately unfruitful diagnostic odysseys prior to enrollment. These patients subsequently undergo extensive phenotypic characterization at UDN clinical sites^[71]15. Both broad Human Phenotype Ontology (HPO) terms and highly detailed clinical notes are collected and made available for all UDN researchers. Phenotypic information includes laboratory evaluations, dysmorphology examinations, specialist assessments, surgical records, and imaging (Fig. [72]1f). Fig. 1. Undiagnosed diseases network cohort analysis. [73]Fig. 1 [74]Open in a new tab a Map of clinical and research sites within the Undiagnosed Diseases Network (UDN) for evaluating patients and candidate variant functionality. Map created using R’s usmap package. b Genetic ancestry across the sequenced patient cohort. c Clinician-recorded primary symptom categories of patients. “Multiple” indicates 2+ categories could be considered primary and “other” indicates an unlisted category. Categories marked with an asterisk (*) are neurological subtypes (Supplementary Note [75]1). d Patient-reported age of first symptom onset. e Patient sex. f Categories and quantity of phenotype information collected for patients and made available to all UDN researchers. g Intronic variants detectable from genome sequencing (orange star) with a predicted splice-altering impact are considered alongside exonic variants in our statistical framework; these variants may result in retained introns or excised exons in processed transcripts. h We consider genes and gene pathways harboring de novo and compound heterozygous variants in sequenced trios (72% of all accepted cases). Other inheritance modes (e.g., homozygous, uniparental disomy) are not considered in our cohort-based framework. Complete case count by family structure (e.g., proband-only, duo) is in Supplementary Fig. [76]2. i Depiction of clinical framework to uniformly evaluate how well a patient’s phenotypes are concordant with a candidate gene or variant. All icons in (f) and (i) are from Microsoft PowerPoint. There is a similar emphasis on collecting sequencing data, with whole genomes sequenced for probands and their immediate or otherwise relevant family members. Although smaller than some other rare disease cohorts^[77]14, the UDN—with a design bridging clinical, research and functional validation teams and a focus on extreme patient presentations—was thought to be optimized for case-based analyses where probands are evaluated individually in depth. Patients’ detailed phenotypic information, ongoing confirmation of new diagnoses, and the potential enrichment for novel genetic disorders make for an ideal data space to validate and develop statistical approaches. We harmonized and jointly called single nucleotide (SNV) and insertion/deletion (indel) variants across 4,236 individuals with whole genome sequencing in the UDN dataset and additionally called de novo mutations from aligned reads across complete trios (Methods, Supplementary Fig. [78]1)^[79]16. Clinical evaluation of computational findings Here, we generate candidate gene–patient matches via a series of statistical genomic analyses implemented in our software suite, Rare Mendelian Disease Enrichment Statistics (RaMeDiES, Fig. [80]1g, h). We focus on the model of monogenic, autosomal inheritance in de novo and compound heterozygous cases to prioritize candidates via a genotype-first approach, with no clinical input or phenotypic information used. Each candidate is then evaluated with respect to the patient’s clinical presentation and the gene and variant’s putative role in disease—based on known disease associations, functionality in model organisms, tissue expression, molecular function, evolutionary constraint, and in silico predicted pathogenicity—to assess phenotypic match (Fig. [81]1i). For genes or gene pathways harboring deleterious variants across multiple individuals, phenotypic similarity between patients is also assessed. To scale clinical evaluation to the cohort level, we developed a semi-quantitative protocol guided by the ClinGen framework^[82]17 that uses hierarchical decision models to increase efficiency and enables consistent and comparable evaluations of a gene–patient diagnostic fit by independent experts (Supplementary Note [83]2, Supplementary Fig. [84]3). We calibrated the protocol during development by testing whether the resulting clinical scores assigned by different experts on the clinical team were in agreement. We validated the protocol in a blind test using non-causative candidate genes as controls. Specifically, non-causative genes were selected with identical criteria to true candidate genes, except biallelic variants were in cis rather than in trans or had low predicted pathogenicity scores. The clinical team applying the protocol consistently scored true candidate genes higher than control genes (Wilcoxon one-sided rank-sum P value = 0.0171, Methods, Supplementary Data [85]1), suggesting that the scores generated by the clinicians’ protocol may be used to prioritize candidates. Results De novo analysis Several highly penetrant, extreme phenotypic presentations underlying Mendelian and other congenital, complex human diseases have been linked to de novo mutations^[86]12,[87]18,[88]19. We began by evaluating all independent, sporadic trios with complete sequencing data for de novo mutation etiologies. We detected 78.3 de novo point mutations and 9.5 de novo indels on average per proband genome concordant with the expectation^[89]20. Mutation count showed expected dependency on parental ages with Poisson-distributed adjusted counts, attesting to the quality of de novo calling (Fig. [90]2a, and Supplementary Data [91]2, Supplementary Fig. [92]4). Fig. 2. De novo recurrence. [93]Fig. 2 [94]Open in a new tab a De novo mutation counts per proband adjusted for parental ages. Blue vertical lines show the mean values of the distributions, and curves represent the Poisson fits. b Schematic of analytical test for the recurrence of de novos that considers distal splice-altering and exonic SNV and indel variants, their variant functionality scores, a genome-wide mutation rate model Roulette, and per-gene GeneBayes constraint values. “Like” variants refer to those of the same variant class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS], intronic indels [II]) and within the same functionality score and minor allele frequency thresholds. c Genes with highest significance values for de novo recurrence across the cohort, computed as described in b3 (Pr’( y[g] )) and b4 (Q[g]), when focusing on missense variants with AlphaMissense and PrimateAI-3D scores; patients are represented as colored circles. Complete gene list with exact P values can be found in Supplementary Data [95]3. Note that multiple testing corrections have been applied in the form of both Bonferroni (**) and FDR (*) thresholds. d AlphaFold-predicted human LRRC7 protein structure (AF-[96]Q96NW7-F1) covering the leucine-rich repeat region with high predicted structural confidence (amino acid positions 86-463). The fifth and eighth LRR domains where missense de novos were found are highlighted in blue. Reference alleles for missense de novo variants observed across two UDN patients (red) are shown in circles. A depiction of LRRC7’s linear protein sequence (Ensembl ID ENSP00000498937) with InterPro predicted domains shown in colored boxes is below. e Overlap of phenotype terms annotated to each patient. We then sought to identify genes enriched for deleterious de novo mutations across our patient cohort. Although the UDN dataset contains sequenced sporadic cases and their unaffected parents, classic case-control associated burden tests are underpowered in the de novo regime because of the extreme rarity of these variants, difficulty in their identification without complete sequenced trios, and noise in variant sampling from both cases and controls^[97]21. Instead, de novo goodness-of-fit enrichment tests, which evaluate whether observed de novo events exceed what would be expected by chance based on mutation rates, are better suited for studying diseases with ultra-rare pathogenic variants^[98]18,[99]19. The power of this de novo enrichment calculation increases with better models of underlying mutation rates and estimates of variant deleteriousness. Recently, the rate of de novo emergence has been estimated at basepair resolution with a high degree of accuracy^[100]22. Newly developed deep learning models for predicting the pathogenicity of de novo and other variants also now exhibit unprecedented accuracy in distinguishing disease-relevant variants^[101]23,[102]24. We leverage these recent advances to build an accurate, unbiased statistical procedure called RaMeDiES-DN to detect genes enriched for deleterious de novos. Unlike the earliest generation of de novo recurrence approaches which leveraged Poisson approximations for runtime efficiency but could not take advantage of improved deleteriousness scores and mutation rate models^[103]18, RaMeDiES methods seamlessly incorporate per-variant deleteriousness scores and mutation rates without sacrificing runtime. Briefly, for a given observed variant in a gene, we define its “mutational target” as the sum of per-variant de novo mutation rates for all possible variants with as high or higher a deleteriousness score. By construction, this per-variant mutational target is expected to be a uniformly distributed statistic, enabling us to leverage closed-form analytical formulas for computing their cumulative probabilities (Supplementary Note [104]3). Our framework naturally combines different variant types including SNVs and indels with a distinct mutation rate model, and can interchangeably utilize various deleteriousness scores (Fig. [105]2b, Methods). Although current state-of-the-art de novo recurrence approaches also incorporate relevant variant-level information, they rely on a complex, permutation procedure^[106]12. RaMeDiES’ analytical approach eliminates the need for permutation-based significance calculations and can process large datasets in mere seconds while maintaining well-calibrated P values (Supplementary Fig. [107]5). We first focus on the subset of missense variants, which comprise a sizable proportion of known Mendelian disease-causing variants and for which new, specialized pathogenicity predictions exist (e.g., PrimateAI-3D and AlphaMissense)^[108]23–[109]25. We find one significant gene, KIF21A, corresponding to the correct, complete diagnosis in one patient and a strong partial diagnosis in one other (Bonferroni-adjusted Cauchy-combined P value < 0.05, Fig. [110]2c). Notably, disease genes with a de novo mode of inheritance are expected to be under strong selection against heterozygous loss-of-function variants. We further refine our method to incorporate this intuition by prioritizing genes by their GeneBayes values, which indicate selection against heterozygous protein-truncating variants, using a weighted false discovery rate (FDR) procedure^[111]26–[112]28. With this correction, we obtain three gene findings at an equivalent significance threshold (Q-value < 3e-6) and eight gene findings at FDR 5% (Supplementary Data [113]3). Our second and third gene hits, BAP1 and RHOA, correspond to a known correct diagnosis in one patient and strong clinical matches in two other patients. Among the five remaining genes at FDR 5%, three genes (CACNA1C, COL4A1 and NOTCH1) correspond to known diagnoses in five patients and the top clinical candidate in one patient. Two impacted patients with de novo missense variants in the leucine-rich repeat region of LRRC7, a gene expressed in the brain, had phenotypic overlap of hypotonia and developmental delay; one patient additionally experienced nystagmus, staring spells, and balance problems and the second had ataxic gait (Fig. [114]2d–e). We discussed these findings with the managing clinical teams overseeing these cases, and these UDN patients have now been included in a clinical case series linking LRRC7 to an emerging neurodevelopmental disorder^[115]13,[116]29. Another gene, NRBP1, remains a strong candidate in two patients due to their neurological phenotype overlap and NRBP1’s expression in the brain. An initial functional study in fly through the UDN Model Organisms Screening Core was inconclusive. This gene has been submitted to MatchMaker Exchange. We next consider all exonic variants, including nonsense variants and indels, and further incorporate additional well-established deleteriousness predictors, CADD and REVEL^[117]30,[118]31. Different mutagenesis processes lead to indel mutations, so SNV mutation rate models can be inappropriate for modeling this mutation type for some genes^[119]32. We therefore constructed a separate per-gene mutation rate approximation for indels (see Methods for details). When we reran RaMeDiES-DN on all exonic variants using four deleteriousness predictors, we additionally identified KMT2B (Bonferroni-adjusted Cauchy-combined p-value < 0.05), corresponding to a correct diagnosis in four patients due to de novo indel variants (Supplementary Data [120]4, Supplementary Fig. [121]6a). The next seven gene findings at FDR 5% had already been identified when assessing recurrence of missense variants. At FDR 10%, we identify five new putative diagnoses. For instance, two patients had high impact missense de novo variants impacting H4C5, a histone gene that was not detected with significance in our missense-only enrichment test due to its lack of precomputed AlphaMissense scores. Both patients had infantile-onset gross motor developmental delays, dysmorphic facial features, and speech difficulties (Supplementary Fig. [122]6b,c). These and other phenotypes exhibited by each patient were recently found to be linked to missense variants in histone H4 genes^[123]33. For one of the patients, the de novo variant was contemporaneously interpreted by UDN clinical experts to be causal^[124]34. The second patient’s de novo variant has now been reclassified as “pathogenic” and resulted in a new diagnosis for this participant. Two other patients with sporadic neurodevelopmental delay each harbor truncating de novo variants in ZNF865. Both patients have phenotypic overlap with a series of 10+ other patients with ZNF865 mutations, which makes a compelling case for pathogenicity^[125]35. Subsequent to the publication of the case series, we anticipate this gene–disease relationship will be established as causal and both variants to be reclassified as likely pathogenic. Inclusion of deep intronic splice variants Next, we demonstrate how RaMeDiES-DN can be extended to additionally consider non-exonic variants uncovered uniquely from whole genome sequencing using the same methodological infrastructure. On the one hand, it remains challenging to identify non-coding regulatory variants involved in rare Mendelian diseases^[126]36, and the overall role of such variants in congenital disorders is still a subject of debate^[127]37. On the other hand, distal gain-of-splice site mutations creating new acceptor or donor splicing sites deep in the intronic sequences of genes are now a well-recognized cause of monogenic disease^[128]38. Identification of splice-altering variants directly from genome sequencing data is recently possible using newly-developed in silico predictive scores without relying on RNA sequencing. RNA sequencing has limitations for diagnosis because it depends on the availability of relevant tissue material that is especially challenging to obtain for neurodevelopmental patients, and it may miss lowly-expressed isoforms and those targeted by nonsense mediated decay^[129]39. Moreover, identifying disease-causal intronic splice variants is especially appealing due to their potential targetability using antisense oligonucleotide therapies^[130]40. Unlike functional predictions for exonic variants, which have been extensively validated for consistency and accuracy via decades of experimental in vitro and in vivo studies, functional predictions of splice-altering intronic variants are relatively new and still require experimental confirmation. We used a combined computational–experimental approach to prioritize distal splice variants using in silico predicted scores and an in vitro massively parallel splicing reporter assay (Methods, Supplementary Fig. [131]7)^[132]41,[133]42. We found the per-variant in silico predictions to be mostly concordant with the in vitro assay readouts. Variants assigned higher in silico scores are more frequently supported by the experimental, in vitro assay, and those with relatively lower in silico scores (SpliceAI <0.5) have a non-negligible validation rate as well (Supplementary Fig. [134]8). This prompted us to incorporate the full range of continuous SpliceAI scores, disregarding only the lowest scoring variants, in our statistics. We found this approach to consider distal splice-site variants attractive because it lends itself to a statistical analysis alongside exonic variants. Once genome-wide functionality score tracks are released for the next generation of splice predictors as well (e.g., Pangolin)^[135]43, they can be integrated into RaMeDiES using the same methodology leveraged for exonic variant predictors. No new candidate genes with a significant recurrence of intronic de novos were found in the UDN dataset. However, by seamlessly incorporating non-exonic variants within the same statistical test, our approach enables a more complete, automated analysis of the growing volume of whole genome sequencing data across rare disease consortia. We also ran the state-of-the-art de novo enrichment approach, DeNovoWEST^[136]12. Unlike our approach, DeNovoWEST incorporates a gain-of-function model alongside a loss-of-function model, which has the potential to yield additional findings. We equipped the DeNovoWEST algorithm with the Roulette mutation rate model, up-to-date CADD variant deleteriousness and s[het] gene constraint scores^[137]27, and further incorporated deep intronic variants with predicted splice-altering impact (Supplementary Fig. [138]9). This approach yielded two Bonferroni-significant genes when applied to the UDN dataset, one of which was also uncovered by RaMeDiES-DN at Bonferroni significance and the second at a FDR of 6% (KMT2B and H4C5, Supplementary Fig. [139]10). We did not apply an FDR-based approach to DeNovoWEST’s results to consider additional gene findings, because DeNovoWEST P values are a construct over three sometimes dependent tests, rendering an FDR adjustment inappropriate. We also find CSMD1, a highly indel-prone gene, within DeNovoWEST’s top-ranked five genes, likely because indels and SNVs are not distinguished in the mutation rate model^[140]44. We then ran DeNovoWEST and RaMeDiES-DN on a larger cohort of previously published and jointly-analyzed de novo variants from the Deciphering Developmental Diseases (DDD) study, GeneDx, and Radboud University Medical Center (RUMC). Both methods recovered known autosomal dominant disease genes at a comparable rate (Supplementary Fig. [141]16, and Supplementary Data [142]5), but RaMeDiES-DN required less than one-tenth the runtime. As opposed to methods like DeNovoWEST that require individual-level data, RaMeDiES’ operation at the level of mutational targets enables sharing of intermediate statistics across cohorts without revealing patients’ individual variants. The rare disease field has accumulated many individually small, but collectively large, cohorts that have never been included in any joint analyses because data governance restricts external sharing of variant- and patient-level data. As a proof of concept, we ran RaMeDiES-DN on the individual DDD, GeneDx, and RUMC de novo cohorts and subsequently combined the three summary-level mutational target statistics to generate gene findings. This meta-analysis run resulted in the same per-gene significance values as compared to RaMeDiES-DN’s run on individual variants from the combined cohort (Supplementary Data [143]6). Compound heterozygous variant analysis We next evaluate compound heterozygous (comphet) variants, which are the most likely cause of rare recessive disorders in populations with low degrees of consanguinity, as is largely the case in the United States^[144]45. Comphet variants are defined as a pair of alleles landing within the same gene and inherited in trans from unaffected parents who are also heterozygous at these loci. These inherited disease-causing variants tend to be rare in the population, due to the effect of selection against biallelic variant occurrences or against slightly deleterious phenotypes of heterozygous variants^[145]46. Despite the expected low frequency of individual alleles comprising a comphet pair, directly selecting for highly deleterious comphet variants still results in numerous false positive findings at the cohort level, motivating a statistical approach for cohort-level comphet prioritization. Developing a statistical framework analogous to de novo goodness-of-fit enrichment requires modeling the distribution of rare inherited alleles per individual. De novo mutations arise through the universal process of mutagenesis and are therefore straightforward to model. Similarly, the distribution of the total number of all derived alleles per haploid genome (i.e., all non-ancestral variants inherited from one parent without any imposed frequency constraints) are also not dependent on the demographic history of the population and therefore are straightforward to model^[146]47,[147]48. In contrast, however, the distribution of the total number of rare alleles per individual is highly dependent on population structure, which is notoriously difficult to account for. Some previous approaches for determining cohort-level significance of comphet variants ignore population structure when modeling the number of rare variants. Although this may be an accurate statistical test in controlled model organism cross experiments, it is inappropriate for natural human populations, where population structure is present even at a very fine scale^[148]49. In the Genome of the Netherlands (GoNL) dataset for instance, the number of synonymous singletons across unrelated individuals still reflects geographic structure along a south-north cline^[149]50. In our framework, we sidestep directly modeling the distribution of rare variant counts per individual and instead condition on the observed number of rare variants inherited from each parent using trio-level data. Given the number of rare variants inherited from each parent per individual, we then compute the probabilities of comphet variants landing in high-scoring positions in the same gene across the cohort. Although the positions where inherited variants land is influenced in part by direct and background selection and biased gene conversion, for very rare variants, the effect of these factors is negligible compared to the effect of the variation in mutation rate along the genome and the overall gene target size^[150]22,[151]51. We therefore model the positional distribution of rare inherited variants using the same Roulette basepair-resolution de novo mutation rate model leveraged in our de novo recurrence model. Our comphet recurrence model, called RaMeDiES-CH, relies on the comphet mutational target, computed for each comphet variant pair and defined similarly as the de novo mutational target previously introduced. Specifically, the comphet mutational target is computed as the total squared mutation rate of all possible variants with higher functionality scores (Fig. [152]3a). RaMeDiES-CH applies the Cauchy P value combination approach as before to leverage multiple variant-level functionality scores while considering exonic and intronic variants, but does not incorporate gene constraint scores, which do not exist for recessive selection (Methods, Supplementary Fig. [153]11)^[154]52. RaMeDiES-CH computes well-calibrated per-gene P values for comphet variants in a cohort (Supplementary Fig. [155]5). Fig. 3. Compound heterozygous variants. [156]Fig. 3 [157]Open in a new tab a Illustration of the unnormalized squared mutational target computed for each observed comphet variant in a gene across the cohort (RaMeDiES-CH, Supplementary Fig. [158]11) or in an individual across the genome (RaMeDiES-IND, Supplementary Fig. [159]12). “Like” variants refer to those of the same variant class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS], intronic indels [II]) and within the same functionality score and minor allele frequency thresholds. b Top ranked genes resulting in the best enrichment statistic computed for RaMeDiES-IND. Putative candidates refer to genes that remain candidates for pathogenicity due to their phenotypically-relevant tissue expression, but where there is not enough functional evidence or published gene–disease relationships to establish causality at this time. c Overlap between phenotypes associated with MED11 and those exhibited by the affected patient. d RNA-Seq reads from whole blood samples aligned to first two exons and first intron of MED11 for proband (black), dad (blue), mom (purple) and two tissue-matched control samples (gray). Thin green line represents the intron, solid boxes represent protein-coding exonic regions, and the dotted box represents the 5’ untranslated region of MED11. (e) Proband exhibits significant retention of the first intron relative to parents and fifty-three tissue-matched control samples. Intron retention ratio is calculated as the (median read depth of first intron) / (number of reads spanning first and second exons + median read depth of first intron). Across the set of non-consanguineous UDN families, we did not find significant recurrent comphet occurrences across genes. This result is unsurprising, as previous estimates suggest that in panmictic disease populations, only one deleterious comphet variant is expected for every five dominant de novos^[160]49. Nevertheless, RaMeDiES-CH represents an accurate and unbiased statistical test for the recurrence of comphet variants in human populations, which can be applied to reveal new diagnoses as sequenced rare disease datasets expand. We suspected that singleton disease-causing comphet variants were still present in the cohort. We adapted our statistical framework to compute an individual-based statistic, RaMeDiES-IND, that normalizes each observed comphet variant mutational target across all genes in the genome rather than across all individuals in a cohort (Supplementary Fig. [161]12). This approach yielded a ranked list of patient–gene pairs across the UDN cohort, where each patient–gene pair could be annotated as corresponding to a correct diagnosis or otherwise (Supplementary Data [162]7). We computed a single enrichment statistic for this overall patient–gene ranking, which simultaneously suggested a threshold for clinical consideration of findings, as the best Fisher’s exact test P achieved across all positions in the list. This enrichment statistic was significant when compared to the distribution of the same statistic computed across 10,000 random shuffles of the patient–gene list (permutation P value = 0.001, Methods, Supplementary Fig. [163]13). Among the top thirteen hits yielding this best enrichment statistic, we recapitulated five known diagnoses (i.e., NEUROG3, PAH, COX20, NDUFAF8, PRDX3)^[164]53,[165]54 and newly identified the genomic cause of a known biochemical diagnosis (i.e., ACADM in a patient with MCAD deficiency). We also identified comphet variants in MED11 which are now leading diagnostic candidates in an undiagnosed patient experiencing neurodegeneration, developmental delay, brain abnormalities, chorea, and hypotonia (Fig. [166]3c). MED11 is associated with epilepsy and intellectual disability, and this patient’s presentation could represent a phenotypic expansion of this known disorder^[167]55. Both inherited variants occur deep in the first intron of MED11, a region that would be missed by exome-only sequencing or analysis, and are predicted to cause cryptic splice donor gains. Transcriptome (RNA) sequencing of blood samples from the affected patient and both parents highlighted a significantly higher rate of first intron retention in the affected patient relative to both parents and to fifty unrelated blood control samples (Fig. [168]3d–e, and Supplementary Fig. [169]14)^[170]56. Our comphet models assume that the two inherited alleles comprising a biallelic variant pair arose as a result of independent mutations. Homozygous recessive variants, where the exact same allele is inherited from each parent, are considered alongside compound heterozygous variant configurations in cases where the allele independence assumption is unlikely to be violated. In individuals with consanguineous parentage, however, the same heterozygous variant in both related parents most likely arose from a single, ancestral mutational event. As such, we directly exclude probands with evidence of parental relatedness from our analysis (Supplementary Note [171]4)^[172]49. Pathway analysis Genes involved in the same pathway are frequently involved in similar phenotypic presentations^[173]57–[174]60. This provides an enticing possibility of drawing statistical power from multiple independent occurrences of deleterious variants in the same functional units, rather than just in the same genes. Moreover, therapeutics for disorders of the same functional unit that are individually too rare to meet minimal participant requirements for clinical trials may be evaluated together within the same umbrella or basket trial for more efficient approval^[175]61. However, such an approach should be pursued with caution, as the phenotypes stemming from perturbations of different genes in the same functional unit may vary to a great extent. Such differences in patient presentations may render the clinical evaluations and therapeutic potential of statistically significant findings virtually impossible. To mitigate this issue, we first initially consider groups of patients with similar phenotypes, and then within each of these groups, assess the overrepresentation of deleterious mutations across established biological pathways (Fig. [176]4a). Fig. 4. Biological pathways enriched within phenotypically-similar patient subgroups. [177]Fig. 4 [178]Open in a new tab a Schematic illustrating the two-step process of first clustering patients according to the semantic similarity of their phenotype terms and second finding enriched biological pathways among the genes within each patient cluster. b The most significant pathways per cluster (P value < 0.01, adjusted for multiple testing using g:Profiler’s Statistical Correction Scheme^[179]103) with 1+ genes from 1+ undiagnosed patients; complete list in Supplementary Data [180]8. c Two patients with primarily immune-related symptoms each harbored a compelling de novo variant in genes involved in immunoproteasome assembly (POMP) and structure (PSMB8). Their symptoms strongly overlap, and a subset of these symptoms were also known to be associated with either gene in OMIM. d Three neurological patients had variants in transmembrane genes involved in the same pathway. These patients had substantial phenotypic overlap with each other, as expected, and with the phenotypes associated with each of their genes (depicted as star shapes in the upset plot). All icons in (a), (c) and (d) are from Microsoft PowerPoint. We start by clustering 2662 affected patients—with or without sequencing data—into 120 groups (median = 17, min = 2, max = 97 patients per cluster) based on the semantic similarity of their phenotype terms. Within each cluster, we then combine our de novo candidates, compound heterozygous candidates and known UDN diagnoses and perform gene set enrichment analysis (Methods, Supplementary Data [181]8). We focus our attention on undiagnosed cases with de novo or compound heterozygous candidates within enriched pathways in each cluster (Fig. [182]4b). We also report all enriched pathways including those with only diagnosed patients for potential therapeutic grouping (Supplementary Data [183]9). Two of three total candidate genes in one cluster with 19 immunological disorder patients are both involved in the immunoproteasome complex (KEGG:03050, n = 46, adjusted P value = 4.42e-3). One patient’s genome contained a known diagnostic, de novo frameshift variant in POMP, an immunoproteasome chaperone protein^[184]62. An undiagnosed patient with evidence of chronic inflammation, recurrent infections, and skin lesions had a missense de novo in PSMB8, a component of the immunoproteasome β-ring with overlapping phenotypic associations (OMIM:256040). Both patients had similar combined immunodeficiency beyond what was captured in their standardized phenotype terms, including decreased global antibodies, decreased B cells and natural killer cells, and retained T cell functionality (Fig. [185]4c). Disruptions to immunoproteasome assembly and structure have been shown to lead to an accumulation of precursor intermediates, impaired proteolytic activity and subsequent uncontrolled inflammation^[186]63. In another cluster of 15 similarly presenting neurological patients, three candidate transmembrane genes were represented in the same functional pathway named for some genes’ known involvement in taste transduction (KEGG:04742, n = 85, adjusted P value = 7.45e-3). Two of these genes, CACNA1C and GABRA3, harbored high impact de novo and hemizygous missense variants respectively, corresponding to known patient diagnoses^[187]64,[188]65. The genome of another, undiagnosed, now deceased patient from this cluster with no prior candidate variants contained a synonymous de novo variant predicted to alter splicing in another gene in the same functional pathway, HCN4 (Fig. [189]4d). All three patients exhibited seizures at a young age, speech delays, severe hypotonia, spasticity and visual impairment. Mouse knockouts of HCN4 demonstrate neurological phenotypes^[190]66,[191]67. In humans, HCN4 is expressed in the visual and nervous systems and has recently been associated with infantile epilepsy, suggesting that this patient’s undiagnosed disorder plausibly represents a phenotypic expansion of this gene^[192]66,[193]67. Discussion In total, we analyze 886 sporadic or suspected recessive cases with complete trio or quad genome sequencing alongside an additional 463 phenotyped, diagnosed individuals using computational methods to identify de novo recurrence, compound heterozygosity, and pathway enrichment. We establish five new diagnoses and three new putative diagnoses in known disease-causing genes or genes previously unlinked to these patients’ exact presentations. Our prioritization framework for pathway analysis further recapitulates 70 known de novo and 10 known comphet diagnoses and suggests 82 de novo and eight comphet candidates for follow-up (Methods, Supplementary Data [194]8). In the field of common disease genetics, statistical inference of disease-associated genomic loci is confidently regarded as primary evidence for their causality. Rare disease genetics, in contrast, is in a transition state. Due to a lack of large disease-matched cohorts, case-based analyses of individual probands relying heavily on detailed patient phenotyping and clinical intuition have typically been used to generate candidate variant hypotheses. Evidence required to shift these variants from uncertain significance to known pathogenic status comes from experimental, functional studies and by identifying additional, unrelated, genotype-matched individuals with similar phenotypes through variant matchmaking services such as MatchMaker Exchange^[195]68,[196]69. Recently, analyses of large, broadly-phenotyped cohorts of rare disease patients have demonstrated the potential for statistical approaches to reveal diagnoses and generate new gene discoveries in the rare disease space as well^[197]12,[198]14,[199]70. Although the genome is a big place, it is also a finite space with respect to gene regions impacted by simple variants such as SNVs and short (<10 basepairs) indels. This suggests that, in theory, recurrence-based statistical methods applied to sufficiently large sequenced cohorts of rare disease patients, even those with diverse phenotypic presentations like the UDN, will enable the eventual discovery of all causes of prenatally viable monogenic disease stemming from these variant types. In order to take statistical discoveries as primary evidence, as is the case for common diseases, we need accurate, well-calibrated statistical methods^[200]71. Even slight model misspecification may propagate and exacerbate the rate of false discoveries. The rapid growth of genomic datasets on which these models may be applied, coupled with an ongoing difficulty in phenotyping patients at scale to confirm findings^[201]72, further increases the urgency for more rigorous models. Here we show that well-calibrated statistical models can be built for both de novo and compound heterozygous modes of inheritance. Although novel disease–gene discovery from large, phenotypically- and genetically-homogenous cohorts has been demonstrated, we show here that rigorous analysis of a diverse, moderately-sized disease cohort at the gene and the pathway level shows promise. Extending our RaMeDiES statistical framework to consider variant types with less accurate mutation rate models or pathogenicity predictions may lead to a sacrifice of precision. We therefore acknowledge the limitations of our statistical approach for comprehensive rare disease diagnosis efforts using short-read sequencing data, and suggest that our cohort-level models be applied alongside existing case-based variant prioritization pipelines for additional statistically-meaningful findings. First, although our models integrate non-coding variants with predicted splice-altering impacts, they do not consider potentially functional variants within whole genome data that fall into untranslated gene regions, RNA-coding genes or between genes, as genome-wide tracks of verifiable deleteriousness scores do not yet exist for these variant types. Improvements to and precomputed scores for these variants will be beneficial for interpretation efforts in general and can be leveraged in future iterations of RaMeDiES. Our statistical analysis also does not consider structural, large indel, copy number, or tandem repeat variants, as their identification from short-read sequencing data is computationally expensive and often inaccurate. Ongoing efforts to generate long-read sequencing data within the UDN and elsewhere should enable improved identification and analysis of pathogenic complex variants^[202]73,[203]74. Developing a statistical model for these variants will still require accurate mutation rate estimates for these variant types, which is lacking. GnomAD-SV represents a promising iteration toward this goal, but is still highly dependent on their specific variant calling pipeline and data rather than biological mutagenic processes^[204]75. The presented method considers only autosomal de novo and compound heterozygous inheritance patterns due to complications in modeling other disease-relevant inheritance patterns. First, it is difficult to propose a statistical model for biallelic variant counts in consanguineous and founder populations, because these counts strongly depend on the ancestral population history and inbreeding patterns. A more appropriate statistical approach for assessing recurrence of these variants would be the extension of parametric linkage applied to very large cohorts^[205]76. Second, inclusion of hemizygous or other X-chromosome variants requires accurate sex-chromosome variant calling, which is notoriously error prone, as well as an accurate mutational model of the X chromosome, which is complicated due to sex-dependent selection and random X-inactivation^[206]77. Finally, although we do not model parental mosaicism or uniparental disomy in our recurrence statistics, these inheritance patterns and events are regularly assessed via complementary, traditional case-based approaches^[207]10. Even though genomic sequencing has been liberalized, currently many analyses are still restricted to individual programs, and regulatory and technical barriers prevent sharing individual-level variant data broadly. In contrast, there are avenues for sharing some variant-level data in a way that is easily accessible to clinical geneticists. MatchMaker Exchange, for instance, enables the sharing of specific variants prioritized through case-based analyses with the goal of finding new genotype- and phenotype-matched patients. Broadening the success of MatchMaker Exchange to include variants that may not have risen to the level of strong candidates in analyses of individual patients is desirable. We developed a browser containing our gene-level findings and variant-level information about rare genetic variation in UDN patients ([208]https://dbmi-bgm.github.io/udn-browser/). In addition, we provide an open-source software package, RaMeDiES, implementing the efficient and well-calibrated statistics for de novo recurrence and deleterious compound heterozygous inference proposed here. RaMeDiES’ operation on shareable summary statistics rather than on variant-level data enables automated, deidentified cross-analysis of substantial existing yet siloed sequenced cohorts for new diagnostic discoveries. As the Mendelian genomics field continues the transition to this new data science phase, the methods we present here should facilitate the exciting prospect of international cross-cohort analyses, resulting in new findings and a vastly improved rare disease diagnostic rate globally. Methods This work was performed in accordance with all ethical guidelines outlined in the NIH IRB #15HG0130. The study proposal and manuscript were approved by the UDN Publications and Research Committee. All participants have provided written, informed consent for the sharing and use of their data in this study. Undiagnosed diseases network (UDN) structure The Undiagnosed Diseases Network (UDN) was established in 2014 with the goal of uncovering clinical diagnoses and novel disease-causing genetic variants and their molecular functionalities. In its current phase, the UDN is composed of 12 clinical research centers across the United States and a CLIA-certified sequencing core at Baylor Genetics. Typical UDN patients have already endured a multiyear “diagnostic odyssey” of extensive prior testing by multiple medical specialists and often inconclusive targeted, whole exome and even whole genome sequencing at the time of their application to the UDN. As part of the application process, a team of clinicians and genetic counselors at one of the UDN clinical sites reviews the patient’s medical records, referral letters and lab data and creates an abstracted case review document. If the team concludes that a UDN evaluation may aid in the identification of a diagnosis, the patient is accepted to the program and undergoes a thorough in-person evaluation at their assigned clinical site. Most patients and available affected and unaffected family members receive whole genome sequencing (GS) as well. All genomic sequencing data, clinical sequencing reports prepared in accordance with the American College of Medical Genetics and Genomics (ACMG) variant classification guidelines, structured phenotyping in the form of Human Phenotype Ontology (HPO) terms, lab results, imaging data, medication data, referral letters and clinical notes, the abstracted case review document, and candidate variants and genes are uploaded to the UDN Data Management and Coordinating Center. All patients enrolled in the UDN have consented to the broad sharing of all their genomic, phenotypic and clinical data with researchers network-wide for use in research projects and when evaluating gene–phenotype fit for a specific patient and candidate gene. Moreover, UDN patients have consented to follow-up if additional tests or information are deemed useful. Harmonization of whole genome sequencing data Short-read whole genome sequencing was performed between 2014 and 2022 in accordance with the UDN Manual of Operations, which specifies that the average coverage across the genome must be >40X, and >97.5% of all coding and noncoding genes (UTRs, coding regions, and intronic regions) must be covered at >20X. Paired-end FASTQs were retrieved in June 2022 for 4268 samples collected from 4236 unique individuals. Six individuals subsequently dropped out of the UDN program and are excluded from the analyses presented here. All FASTQ pairs were within expected parameters ([209]https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, version 08-01-19) and were aligned to human reference hg38 (with decoys and all alt contigs) using the Sentieon^[210]78 bwa-mem implementation via the Clinical Genome Analysis Pipeline (CGAP, [211]https://cgap.hms.harvard.edu/, version 29cefcec). Read groups were added via a custom CGAP script, multiple FASTQ pairs corresponding to the same sample were merged, and resulting BAMs were sorted, deduped, and recalibrated using a Sentieon implementation. GVCFs were produced using CGAP’s implementation of GATK’s HaplotypeCaller. All processing steps from FASTQ to GVCF were deployed on spot instances in Amazon Web Services (AWS). GVCFs were then egressed to the Harvard Medical School institutional cluster. SNVs/indels were jointly called across genomic shards then merged using Sentieon tools (version 202112.02). Per-sample sex and cross-sample relatedness were confirmed using Somalier (version 0.2.15, Supplementary Fig. [212]1)^[213]79. We required that all trios under consideration in our analysis had two parents reported as “unaffected”, a child reported as “affected”, parent–child relatedness 0.5 ± 0.075, parent–parent relatedness <0.15, mothers had heterozygous variants present and a scaled mean depth of ~2 on chromosome X, and fathers had a scaled mean depth of ~1 on chromosome Y. All variants were annotated using Ensembl VEP (version 108) and slivar (version 0.2.7) for TOPMed and per-population gnomAD (versions v2.1.1 and v3.1.2) variant frequencies and homozygote counts^[214]80,[215]81. For our compound heterozygote analysis, we inferred within-family regions of identity by descent (IBD) using KING (version 2.3.0)^[216]82. We required at least one IBD region between the child and each parent to further confirm relatedness (in addition to kinship coefficient filtering) and no IBD regions of length > 3 Mb between parents to confirm non-consanguinity between parents. In families with multiple affected siblings, we select one sibling as the proband and disregard the other siblings during initial analyses. Variants in other affected siblings were then used to check segregation during validation of our findings. This process resulted in 846 non-consanguineous trios with an affected child and two unaffected parents for our analyses. We chose to stringently exclude individuals with evidence of familial consanguinity (i.e., by imposing a parental relatedness and IBD region length constraints) rather than excluding patients based on their relative recessive burden because an assumption of our statistical models is violated in consanguineous cases (Supplementary Note [217]4). Clinical evaluation framework Protocol overview We developed a clinical analysis protocol to reduce subjectivity in the assessment of diagnostic candidates. We used the case evaluation process implemented at Brigham Genomic Medicine as a foundation^[218]83. We then transformed this process into a systematic and structured protocol with inspiration from the gene–disease association criteria developed by the Clinical Genome Resource (ClinGen) group^[219]17,[220]68. Evidence in support of or against a candidate variant–participant match was evaluated by a team of clinical geneticists according to three categories for experimental evidence not taken into account by our statistical analyses: (i) model organism or cell line studies, (ii) tissue expression, and (iii) protein molecular function. Clinicians also took into account case-level data and published literature with case-control data including (iv) known disease associations, (v) gene evolutionary constraint, and (vi) variant pathogenicity. Discrepancies in opinion were mediated by joint discussion until a consensus decision was reached. A detailed description of the protocol and scoring scheme is available in Supplementary Note [221]2 and hierarchical decision trees to streamline the scoring process are provided in Supplementary Fig. [222]2. Clinical score calibration We ensured that the protocol was specific and detailed to the extent that different clinicians with access to the same patient data would independently assign equal clinical scores to the same candidates. Over the course of two months, at least two clinicians each evaluated 2–3 compound heterozygous candidates per week and independently recorded their notes, final clinical scores, and score rationale in a REDCap database^[223]84. At weekly joint discussions, they iteratively updated the protocol to improve specificity and reduce discrepancies in scoring. The final two joint discussions confirmed that categorical and final clinical scores assigned by different clinicians were consistently in agreement. Validation The clinical team was provided with ten “candidates” and ten “decoys” from real UDN patients in random order to evaluate. The team was blinded to gene labels, variant inheritance and SpliceAI score information during evaluation. “Candidate” genes had two rare variants (gnomAD popmax AF < 0.001) inherited in trans where one variant was exonic with CADD > 23 and the second variant was intronic with a max SpliceAI > 0.3. “Decoy” genes were selected with identical criteria except that variants were actually inherited in cis or the intronic variant had a maximum SpliceAI score of 0. After assigning final clinical scores to each of the 20 genes, the candidate/decoy labels were revealed to the clinical team (Supplementary Data [224]1). Identification of de novo variants For each of the 1463 sequenced trios in our harmonized UDN dataset, including trios with unaffected offspring, we select the subset of variants with read depth >10 and genotype quality (GQ) >20 across proband, mother and father. We further subset to variants with a “high” Roulette quality score, gnomAD population maximum allele frequency <0.01, TOPMed^[225]85 allele frequency < 0.01, proband alternate allele read depth >4 and frequency > 0.2, and alternate read depth <2 in both parents. We then utilize observed aligned reads across each trio and across thirty unrelated individuals to assign posterior probabilities to each putative de novo variant on autosomes using the CGAP reimplementation of novoCaller ([226]https://cgap.hms.harvard.edu/)^[227]16. We consider all de novos with a novoCaller posterior probability >0.7 to be high confidence, noting that thresholding the novoCaller posterior probability from 0.5 to 0.95 has negligible impact on the number of passing variants overall and per-proband (Supplementary Fig. [228]6a). We further exclude probands with over 150 high confidence de novo calls, as these patients frequently had “suspected parental mosaicism” mentioned in their clinical records. Finally, because clonal sperm mosaicism may lead to siblings inheriting identical de novo variants, we exclude duplicate de novo variants within each family from downstream recurrence analyses^[229]86. This process resulted in 1072 trios with an affected proband and unaffected parents for further analysis. Analytical test for de novo cohort-level recurrence Basic statistic definition We define a cohort as a set of N genomes (i.e., collections of genes) each with sets of de novo variants arising independently but based on the same background de novo mutation rate. Let [MATH: μi :MATH] denote the de novo mutation rate of a specific variant i. The mutational target of a gene g is [MATH: μg=ig μi. :MATH] 1 The mutational target of a variant [MATH: v :MATH] in gene g is [MATH: μg,v=ig μi1sc oreiscore v :MATH] 2 where [MATH: scorei :MATH] is the deleteriousness score of variant i. Intuitively, the more surprising and/or deleterious a variant, the smaller its mutational target. By definition, variant mutational targets are uniformly distributed from 0 to [MATH: μg :MATH] , so [MATH: μg,vμg~U[0,1] :MATH] 3 ppose there are K de novo variants falling within gene g across the cohort, where [MATH: K1 :MATH] . We define a statistic y as [MATH: y=i=1K1μg,viμ g :MATH] 4 Note that Y is a sum of K uniformly distributed variables on [0,1] under the null. The distribution of Y given parameter K can thus be modeled by the Irwin-Hall (IH) “sum of uniforms'' distribution, which has a closed form for its cumulative density function (CDF) and thus also for its survival function (SF), where [MATH: SF= 1CDF :MATH] ^[230]87. This enables us to replace permutation-based significance evaluations and instead analytically compute the probability of achieving a Y as high or higher than observed with K variants using the IH survival function as [MATH: Pr(yK)=IHSF(y< /mi>K) :MATH] . We note that there are many other constructions over a set of uniformly-distributed random variables (such as P values)^[231]88,[232]89. We further note that as the cohort size dramatically increases, the Irwin-Hall distribution can be replaced with the normal distribution. Finally, we also modelPr(K), the probability of K independent de novo variants to land in gene g given this cohort of size N, to assign an overall significance value to our statistic y as [MATH: Pr(y)=K=1Pr(yK) Pr(K) . :MATH] 5 Because neither y nor Pr(K) are defined for [MATH: K= 0 :MATH] , we do not expect Pr( y) to be uniformly distributed. Instead, only [MATH: Pr(yK1)=P r(y)/Pr(K1) :MATH] is expected to be uniformly distributed (Supplementary Fig. [233]9). In a single genome with [MATH: n :MATH] total observed de novo variants, the number of de novo variants to land in a particular gene g, given that [MATH: μg1 :MATH] , is Poisson distributed, parameterized by the expected number of de novos [MATH: λ=nμg< /mi> :MATH] . In a cohort of N genomes, the number of de novo variants to land in gene g is therefore a sum of N Poisson-distributed random variables, which itself is also Poisson distributed. We thus compute [MATH: Pr(K)=Pois( Kλ) :MATH] , where λ is given by [MATH: E[K]λ=μgj=1Nnj. :MATH] 6 Different deleteriousness scores for coding and intronic variants We use continuous, per-variant deleteriousness scores that are precomputed and publicly-available for all possible variants genome-wide in our computations. Precomputed scores are required for the calculation of comprehensive, basepair-resolution mutational targets as described above. For missense variants, we interchangeably use AlphaMissense (version hg38 released with their 2023 publication), PrimateAI-3D (academic license, accessed May 2024), CADD (version 1.6), and REVEL (accessed May 2024).23,24,30,31 CADD is also used for scoring all other exonic variants, including nonsense and indel variants. For intronic variants, we use SpliceAI (academic license, accessed May 2021).41 We use different variant functionality scores for exonic and intronic variants because we found that these values are poorly correlated with each other in intronic space (Supplementary Fig. [234]15). Clinical sequencing centers also regularly report these scores, suggesting their relevance in rare disorders^[235]68,[236]90. Different mutation rate models for SNV and indel variants We use Roulette de novo mutation rates for SNVs genome-wide. Different mutational processes lead to indel mutations, so Roulette values cannot necessarily be adapted to model this mutation type.32 We approximated per-gene joint distributions of indel mutation rates and deleteriousness scores as follows. First, we considered all possible exonic indels of length ≤10nt for which precomputed CADD scores were available for download and all possible intronic insertions of length 1nt and deletions of length ≤4nt for which precomputed SpliceAI scores were available for download. Although SpliceAI provides predictions exhaustively for all possible indels, CADD provides scores for the subset of indels observed in gnomAD-v2. We excluded all indels that overlapped with any SNVs assigned a Roulette “low quality'' filter, which are based on gnomAD quality metrics, abnormal density of segregating sites, and suspicious patterns of recurrence. We further excluded indels with a gnomAD popmax MAF >0.1% and/or a number of alleles in gnomAD (AN) in the bottom decile. For exonic and intronic variants separately, we binned all indels by their precomputed CADD or SpliceAI score rounded to the nearest hundredth. The total number of indels within a deleteriousness score bin and all bins corresponding to higher deleteriousness scores was used as an approximation to the mutational target associated with that score. Incorporation of different variant types Because there are different deleteriousness scores for coding and intronic variants and different mutational targets for SNV and indel variants, we expand our basic test statistic to accommodate different variant types [MATH: t :MATH] {coding SNV, coding indel, intronic SNV, intronic indel}. We redefine a gene and variant mutational target with respect to each variant type as [MATH: μg,t=ig,tμi :MATH] 7 and [MATH: μg,v,< /mo>t=ig,tμi1sc oreiscore v :MATH] 8 where g,t refers to the subset of all possible variants in gene g of type t. We define y' as [MATH: y=ti=1 K1μg,vi,tμg,t< /msub>~IHyt< mi>Kt :MATH] 9 where K[t] is the number of observed de novo mutations of variant type t landing in gene g. The expected number of de novos to land in gene g when considering different variant types is [MATH: E[K]λ=t< mi>μg,tj=1Nnj,t :MATH] 10 where [MATH: nj,t :MATH] denotes the total number of observed de novo variants of type t in an individual j. For each variant type t, we scale [MATH: μg,t :MATH] such that [MATH: g μg,t= 1 :MATH] . We compute [MATH: Pr(K)=Pois( Kλ) :MATH] . Cauchy-combination of P values computed with different deleteriousness predictors We can run our method using different deleteriousness score predictions for coding SNVs (i.e., AlphaMissense, PrimateAI-3D, CADD, or REVEL), resulting in slightly different lists of genes with corresponding P values when incorporating this variant type. We combine these lists using the Cauchy combination test, an analytic calculation that is applicable under arbitrary dependence structures^[237]89. Incorporation of GeneBayes values We incorporate GeneBayes values, which estimate the selection against heterozygous protein-truncating variants per gene, as weights in a weighted false discovery rate (FDR) procedure^[238]26,[239]91. We sort all genes in ascending order by their GeneBayes values. We then separate these sorted genes into 10 equally sized decile bins. For each gene g in each bin b[g], we compute a weight [MATH: wg :MATH] as [MATH: wg=10DDb[g]DD :MATH] 11 where DD is the set of exclusively dominant disease-causing genes as annotated in OMIM (accessed December 2023). Genes without GeneBayes values are assigned a weight [MATH: wg= 1 :MATH] . Note that [MATH: E[w g]=1 :MATH] and that GeneBayes values, which are constant for all variants within a given gene, are independent from y and y' values, which vary for variants within a gene based on variant mutational targets and deleteriousness scores. This enables us to perform Benjamini-Hochberg false discovery rate correction on weighted Q-values computed for each Pr(y') as [MATH: Q=Pr(y)/wg :MATH] .26 Massively parallel splicing reporter assay (MPSA) Assay design We designed oligonucleotides to evaluate the impact of a variant predicted to cause a cryptic splice site gain or a canonical splice site loss. For each variant with a predicted splice-altering impact, we extracted the surrounding genomic sequence from the UDN patient harboring the variant (alternate) as well as a paired version with the variant of interest replaced with the reference allele (reference). We centered the candidate sequence on the variant of interest, noting that the impacted splice site junction could be up to 50 nucleotides away from the variant. For a subset of variants, we also generated candidate sequences that were centered on the predicted site of the altered splice junction rather than on the variant itself. We embedded each candidate sequence in an oligonucleotide template containing a 4-nt barcode and flanking primers as follows: Splice donor library structure GCACGGACAAAGTACTAGCC [155-nt candidate sequence][4-nt SD-associated barcode] GGAAGATCGACGCAGgtaagt Splice acceptor library structure TGCTCTTATGCGAACGTGTTAAC [4-nt SA-associated barcode] [151-nt candidate sequence] GGAAGATCGACGCAGgtaagtt The final oligonucleotide library contained 6000 200-nt oligonucleotides, encompassing 1920 alternate/reference pairs, which we ordered from Twist Bioscience. Library cloning and experimental protocol The oligonucleotide library was cloned separately using PCR amplification and NEBuilder assembly into lentiviral splice acceptor (pLenti-MPSA-acceptor) and splice donor (pLenti-MPSA-donor) vectors. These vectors consisted of an EF1A promoter and an mCherry open reading frame (ORF) followed by splicing reporter modules based off of prior published massively parallel splicing reporter constructs^[240]92,[241]93 (Supplementary Fig. [242]7) as well as a separate Puromycin selection cassette. Plasmids have been deposited to Addgene under accession numbers 240805 (“plentiMPSA SAentry PuroR'') for splice acceptors and 240806 (“plentiMPSA SDentry PuroR'') for splice donors. Lentiviral particles for each library were produced and titrated. Each library was transduced at a multiplicity of infection (MOI) of 0.3 in three biological replicates into 6.25*106 cells/replicate of HepG2 (liver cell line derived from primary cells extracted from a white, male, 15-year-old with liver cancer, catalog #HB-8065) and SK-N-SH (neural-like cell line derived from primary cells extracted from a female 4-year-old, catalog #HTB-11), both acquired from American Type Culture Collection (ATCC). Cells were routinely tested for mycoplasma contamination via qPCR, and all tests were consistently negative for mycoplasma. Cells were selected with Puromycin to completion, and genomic DNA and RNA were harvested one week after transduction. PCR-based nextgen sequencing (NGS) library preparation was performed on all 12 genomic DNA and RNA samples. Libraries were sequenced with 75-nt paired-end reads using an Illumina NextSeq 500 sequencer, ensuring an average of >1000 reads per library member from all libraries. Barcode mapping Over ~75% of all RNA reads could be mapped back to a 15-nt barcode found in our starting dictionary. This resulted in ~6—15 million mapped RNA reads per MPSA replicate, yielding a median of 1170 mapped reads per alternate/reference library pair per replicate. Results from Tapestation, an automated electrophoresis system for sizing and quantifying nucleic acid samples, showed that 49.6% of mapped reads from splice donor MPSA experiments utilized some library splice donor site and 50.4% utilized the experimentally fixed site. Across splice acceptor MPSA experiments, 58.3% of mapped reads utilized some library splice acceptor site and 41.7% utilized the fixed site. MPSA validation rate We considered all alternate/reference library pairs with at least 10 barcode-disambiguated mapped reads each in one or more MPSA experiments; 99.4% of pairs met this requirement. Each read was then categorized as (1) using the experimentally fixed splice site, (2) using a splice site corresponding to a known intron/exon junction as annotated in Ensembl, (3) using the SpliceAI-predicted cryptic splice gain site, (4) using a cryptic splice site at a different location, (5) malformed where the read did not begin with the correct fixed sequence due to a next-generation sequencing error, or (6) recombined where the read did not align to the expected oligo sequence at all. The percent of malformed and recombined reads per alternate/reference pair was 7.5% (SD=1.9%) and 6.2% (SD=10.6%) respectively on average. The position of SpliceAI-predicted cryptic splice sites often did not correspond to the expected splice junction based on manual inspection or to the splice sites observed in MPSA experiments (55.4% of splice acceptor and 5.4% of splice donor predicted positions matched). We instead considered the most common cryptic splice site position observed in each alternate library sequence to be the predicted site. MPSA validation rate is computed per alternate/reference library pair as the difference in percentages of total reads supporting the predicted cryptic splice site between oligos containing the alternate variant and the corresponding reference oligonucleotides (Supplementary Fig. [243]8a). We compared the MPSA validation rates across the three biological replicates and two cell types using Pearson's correlation (Supplementary Fig. [244]8b). DeNovoWEST gene-specific enrichment of de novo variants We modified the DeNovoWEST weighted permutation test by first augmenting the set of variants under consideration beyond exonic variants to include all possible intronic variants in protein-coding genes with a SpliceAI score >0.4, resulting in ~400k additional possible variants under consideration.41 To this end, we modified the codebase (version v1.0.0) to consider these intronic putatively splice-altering variants to have the same functional consequence as canonical splice site variants if they had a VEP annotation of “splice_acceptor'' or “splice_donor'' or the same functional consequence as missense variants otherwise. We then updated the required precomputed values, including per-variant mutation rates, minor allele frequencies, deleteriousness scores and per-region constraint values as detailed below, for all exonic and intronic variants under consideration (Supplementary Fig. [245]9a). The underlying triplet-context mutational model was replaced with genome-wide, per-SNV Roulette mutation rate estimates^[246]94. Each variant's minor allele frequency was set to the maximum gnomAD-v3 population or TOPMed allele frequency. Per-variant Phred-scaled and unscaled CADD values were obtained from ([247]https://cadd.gs.washington.edu/) (version 1.6 for GRCh38/hg38). Updated per-gene shet values were obtained from ([248]http://genetics.bwh.harvard.edu/genescores/selection.html) and binned into a “low'' category if mean shet was below 0.15 and a “high'' category otherwise.27 Notably, some stable Ensembl gene IDs in GRCh37/hg19 are not present in GRCh38/hg38 and vice versa; all variants from the 894 GRCh38/hg38 genes without shet values are binned into the “low'' category. Regional missense constraint values, defined for adjacent windows covering the full genomic region of each protein-coding gene were obtained from ([249]https://gnomad.broadinstitute.org/downloads#exac-regional-missens e-constraint). We translated these genomic region coordinates from GRCh37/hg19 to GRCh38/hg38 using UCSC's LiftOver tool and then assigned a constraint value to exonic and intronic variants corresponding to the genomic region they fell into. We recomputed the weights assigned to each variant type using the union of all de novo variants in our cohort and the de novo variants released with DeNovoWEST (encompassing ~31,000 exome-only trios), because the distribution of de novo variant classes in UDN data was very similar to the distribution of de novo variant classes in the dataset used by DeNovoWEST (Supplementary Fig. [250]9b-c) and because the authors warn that weights generated from smaller datasets alone may be unreliable. Gene severity scores were then computed for every gene harboring one or more de novo variants across our cohort. We adjust DeNovoWEST assigned p-values using Bonferroni correction for twice the total number of genes evaluated as suggested by the authors. We find that DeNovoWEST and RaMeDiES-DN (using only CADD in exonic regions as a closer comparison to DeNovoWEST) recovered known autosomal dominant disease genes at a comparable rate across de novo variants provided in the original DeNovoWEST paper (Supplementary Fig. [251]16). Analytical test for compound heterozygous cohort-level recurrence A compound heterozygous configuration is an independent occurrence of two variants: one maternally (M) and the other paternally (D) inherited. Here we use the term “compound heterozygous configuration'' to refer to either two distinct heterozygous variants inherited from either parent, or a single homozygous recessive variant where both independently-inherited alleles arose from independent mutational events (Supplementary Note [252]4). The mutational target of a compound heterozygous configuration should therefore lie in a space of squared mutational targets. We define the mutational target of a compound heterozygous configuration as [MATH: μg, vM,v< /mi>D=ma x(μg,vM,μ g,vD)2 :MATH] 12 where [MATH: vM :MATH] and [MATH: vD :MATH] are maternally and paternally inherited variants comprising a compound heterozygous configuration, and [MATH: μg, vM :MATH] and [MATH: μg, vD :MATH] are computed as in Eq. [253]12. To prioritize compound heterozygous configurations with both deleterious variants, we use the maximum over per-variant mutational targets. A single deleterious variant in a compound heterozygous configuration may indicate carrier status rather than a compelling candidate for a rare disorder. By this definition, [MATH: μg, vM,v< /mi>D :MATH] is uniformly distributed at null (Supplementary Note [254]4). This enables us to define a similarly constructed statistic [MATH: yc :MATH] modelable by the Irwin-Hall distribution as in the case of recurrent de novos (Eq. [255]4): [MATH: yc=j=1 K1μ< /mrow>g,v M,i,vD,iμg 2~IHycK :MATH] 13 where K is the number of compound heterozygous configurations independently landing in gene g across the cohort, and [MATH: vM,j :MATH] and [MATH: vD,j :MATH] are the maternally and paternally inherited variants in gene g in individual j. As before, K is approximately Poisson distributed, and parameter [MATH: λc :MATH] , the expected number of compound heterozygous configurations to land in gene g, is given by [MATH: E[K]λc=μg2j=1NnM,jnD,j< /mrow> :MATH] 14 where [MATH: nM,j :MATH] and [MATH: nD,j :MATH] are the numbers of maternally and paternally inherited rare variants in an individual j, respectively. We compute [MATH: Pr(K)=Pois( Kλc ) :MATH] as before. Finally, we extend this basic test statistic to accommodate 16 compound heterozygous configuration types as [MATH: (tM,tD) :MATH] {coding SNV, coding indel, intronic SNV, intronic indel}2 and define [MATH: yc< mo>′ :MATH] and Poisson parameter [MATH: λc< mo>′ :MATH] accordingly as [MATH: yc< mo>′=tM,t Di=1K< mrow>tM,tD1μg,vM,i,tM,vD,i,tD< /mrow>max(μg,tM,μg,< /mo>tD< /msub>)2 :MATH] 15 and [MATH: E[K]λ< mrow>c= tM,t Dμg,tM< /msub>μg,< /mo>tD< /msub>j=1NnM,tM,jnD,< mi>tD,j :MATH] 16 where [MATH: Kt< mi>M,tD :MATH] is the number of compound heterozygous configurations in gene g across the cohort where the maternally inherited variant is of type [MATH: tM :MATH] and the paternally inherited variant is of type [MATH: tD :MATH] . Instances where [MATH: Kt< mi>M,tD= 0 :MATH] are excluded from the above sums. For each variant type t, we scale [MATH: μg,t :MATH] such that [MATH: g μg,t= 1 :MATH] . We compute the probability of [MATH: yc< mo>′ :MATH] as in Eq. [256]5. Note that inherited homozygous recessive variants that arose from a single, ancestral mutational event violate the assumptions of our approach and are excluded (Supplementary Note [257]4). Modeling false positive diagnoses For any gene where the observed number of variants [MATH: K>E[K] :MATH] across the cohort, we suspect that there are some true diagnoses in specific patients as well as some “false positives'' where a randomly occurring variant in a patient is unrelated to the patient's condition. We use the binomial distribution parameterized by K independent trials and probability of success per trial [MATH: E[K]/N :MATH] to estimate the proportion of false positive diagnoses for each gene. Analytical test for individual-level compound heterozygous configuration Given a set of independent compound heterozygous configurations across genes in a single individual's genome, we construct a test for the hypothesis of a monogenic, recessive disorder caused by one of these compound heterozygous configurations against the null. We assume up to one compound heterozygous configuration per gene, i.e., for each gene g, [MATH: nM nD μg2< /mrow>1 :MATH] , where [MATH: g μg= 1 :MATH] and [MATH: nM :MATH] and [MATH: nD :MATH] are the numbers of maternally and paternally inherited rare variants in this individual's genome. We now rescale the mutational target of a compound heterozygous configuration (Eq. [258]12) with respect to all genes g in the genome as [MATH: μ~g,vM< /mi>,vD=iG min(μ i2, μg, vM,v< /mi>D)iG μi 2. :MATH] 17 Intuitively, this corresponds to the probability of observing a compound heterozygous configuration with an equal or smaller (i.e., more surprising) mutational target occurring in any gene in the genome. Thus, [MATH: μ~~U[0,1] :MATH] . We precompute each gene's compound heterozygous mutational target [MATH: μi 2 :MATH] for all genes in the genome in order to quickly compute [MATH: μ~ :MATH] values for each observed compound heterozygous configuration in an individual. Next, we define our statistic [MATH: y~c :MATH] per individual as the minimal observed rescaled compound heterozygous mutational target: [MATH: y~c=min(μ~1,...,μ~K) :MATH] 18 We compute the probability of observing a [MATH: y~c :MATH] value this low or lower given K total genes with observed compound heterozygous configurations in an individual's genome as [MATH: Pr(y~cK) =1i=1KPr(Y> y~c)=1Pr( Y>y~c) K. :MATH] 19 where Y is a dummy variable. Because [MATH: y~c :MATH] is uniformly distributed on [0,1], [MATH: Pr(Y>y~c)=(1y~c) :MATH] , so we simplify this calculation as [MATH: Pr(y~cK) =1( 1y~c) K :MATH] 20 We also model the distribution of K observed compound heterozygous configurations across an individual's genome in order to compute the overall probability of our statistic [MATH: y~c :MATH] using the same formulation as before (Eq. [259]5). The distribution of K, given our prior assumption of at most one compound heterozygous configuration per gene, has an exact solution as the number of double events in a bivariate binomial distribution with correlation parameter [MATH: ρ :MATH] capturing the effect of different gene lengths on K. However, due to the complexity in calculations of the exact solution, here we use the Poisson approximation instead because, for each gene g, [MATH: g μg= 1 :MATH] and [MATH: nM nD μg2< /mrow>1 :MATH] . The [MATH: λ~c :MATH] parameter for the Poisson approximation in this case is [MATH: E[K]λ~c=nMn Di=1 Gμi2 :MATH] 21 Finally, we accommodate the 16 compound heterozygous configuration types as [MATH: (tM,tD) :MATH] {coding SNV, coding indel, intronic SNV, intronic indel}2 and redefine [MATH: μ~ :MATH] , [MATH: y~c :MATH] and Poisson parameter [MATH: λ~c :MATH] accordingly as [MATH: μ~g,vM< /mi>,vD=tM,t DiG min(μi,tM< /mi>μ i,tD,μg,vM,tM,vD,tD)tM,t DiG μi,tM< mrow>μi,t< /mrow>D :MATH] 22 and [MATH: y~c< mo>=min(μ~1,. ..,μ~K) :MATH] 23 and [MATH: E[K]λ~c=tM,t DnM,tM< /msub>nD,< /mo>tD< /msub>iG μi,tM< mrow>μi,t< /mrow>D :MATH] 24 Enrichment for correct diagnoses Given a ranked list of genes across a cohort of patients, where each gene may be diagnostic for the given patient, we can compute enrichment for correct diagnoses at each gene rank. We use Fisher's exact test to compare the proportion of complete, certain diagnoses in all genes up to and including rank k compared to the proportion of correct diagnoses at genes ranked k+1 through the end of the list. We consider the minimum Fisher's exact test P across all k to be our overall enrichment. We assign a permutation-based P value to this enrichment value by randomly permuting the initial gene list 10,000 times and recomputing the minimum Fisher's exact test P for each permuted list. Transcriptome sequencing analysis for MED11 RNA extraction, sequencing and quality control RNA was extracted from UDN patients' whole blood samples received at UCLA between 2018 and 2019 using PAXgene Blood RNA extraction kits from Qiagen. Concentration of RNA in each sample was quantified using the Qubit 3.0 Fluorometer. RNA integrity numbers (RINs), a quality control measure, were assessed per sample using the Agilent bioanalyzer. RNA libraries were prepared for each sample using either the NuGEN Universal Plus mRNA-Seq kit or the Illumina TruSeq mRNA + Globin Minus kit. Sequencing was then performed on the Illumina NovaSeq 6000 to generate ~50-100 million 100-150bp paired-end reads per sample. Library preparation and sequencing were performed at the UCLA Neuroscience Genomics Core and the UCLA Technology Center for Genomics and Bioinformatics Core. Sequenced reads in FASTQ format were aligned to human reference genome GRCh37 using STAR v2.5.2b with default parameters and Gencode v19 annotations^[260]95,[261]96. To increase sensitivity to novel splice junctions, reads were mapped using the STAR 2-pass mode, where novel splice junctions detected during the first pass alignment are indexed and used alongside known splice junctions in the second pass remapping. We confirmed effective ribosomal RNA (rRNA) depletion per sample by aligning all paired-end reads to the complete sequences for nuclear and mitochondrial rRNAs using BWA-mem v0.7.17^[262]97 and ensured that the proportion of aligned reads did not suggest excessive rRNA contamination. Duplicate reads were marked using PicardTools v4.2.4.0 and post-alignment sequencing quality was assessed using RNA-SeQC v1.1.8 to ensure adequate library complexity^[263]98,[264]99. RNA sample identity was confirmed by comparing single nucleotide variant (SNV) calls from RNA sequencing to SNV calls from corresponding exome or genome sequencing data per sample. Intron retention outlier analysis Fifty-three tissue-matched control samples from UDN participants unrelated to the proband, mother and father were selected for outlier analysis. IRFinder v1.2.456 was run in BAM mode using the same human reference GRCh37 and Gencode v19 annotations on aligned BAM files to measure the level of intron retention (i.e., “IRratio'') in MED11 across the proband, mother, father, and control samples. The IRratio is computed per sample as (median read depth of first intron) / (number of reads spanning first and second exons + median read depth of first intron). Aligned reads covering the MED11 gene region (i.e., chr17:4,634,723-4,636,903) from the proband, mother, father and two control samples were viewed using a local installation of the Integrative Genomics Viewer (IGV) v2.16.0^[265]100. Pathway enrichment analysis Phenotypically-similar patient groupings Phrank (version v2018-12-13) was used to compute all-against-all pairwise phenotype similarity scores between all affected patients' sets of standardized HPO terms. We normalized these scores by dividing by the maximum self-similarity score in each pair^[266]101. UDN patients experience a spectrum of symptoms across overlapping biological categories and therefore cannot be easily separated into distinct, well-defined clusters (Supplementary Fig. [267]17). We iteratively grouped similar patient pairs using complete-linkage hierarchical clustering with the agnes function from R's cluster package (version v2.1.4), which allows for patient groups of different sizes while minimizing the maximum distance between any two patients in the same cluster. We assigned patients to clusters by cutting the resulting dendrogram at height=3.5, resulting in 120 clusters of 2--97 patients per cluster (mean=22, median=17). Selecting genes per patient cluster We identify genes per patient cluster as follows. First, we consider known diagnoses for all UDN patients in that cluster. For patients with a diagnosis that was “complete'' (i.e., explained all symptoms including asserted phenotypes), no further genes are considered. For patients with no diagnosis or at most one “partial'' diagnosis, we then consider genes with an exonic (or intronic with a SpliceAI score >0.4) de novo variant and assign each gene its variant's severity weight (s) from our modified DeNovoWEST procedure. Recall that weights are assigned per variant class based on functional impact (e.g., frameshift, nonsense, missense), variant deleteriousness, and gene constraint. Autosomal de novos are considered as before in addition to de novos on chromosome X with gnomAD population maximum allele frequency < 0.0001, TOPMed allele frequency <0.0001, proband alternate allele read depth >20 and frequency >0.2 (for females) and alternate read depth >20 in both parents. We consider the most significant gene per patient with [MATH: Pr(s)<0.0005 :MATH] , where [MATH: Pr(s)=Pr(SsK=1)Pr(K= 1) :MATH] . [MATH: Pr(SsK=1) :MATH] is computed exactly using precomputed per-variant Roulette mutation rates and variant weights per gene. Pr(K=1) is computed assuming mutations follow a Poisson distribution with [MATH: λ=μge< /mi>ne :MATH] for genes falling on chromosome X in males and [MATH: λ= 2μgen e :MATH] otherwise. Finally, in patients who still have fewer than two genes at this point and no complete diagnoses, we include up to one additional gene harboring a compound heterozygous variant pair that ranked in the top 100 in our RaMeDiES-IND cohort-wide per-individual analysis, as there was significant enrichment for correct diagnoses in this set (Fisher's exact test P = 5.23e-4). Across all patient clusters, we considered 70 genes (6.15%) with de novo variants corresponding to known diagnoses, 10 genes (0.88%) with compound heterozygous variants corresponding to known diagnoses, 562 (49.38%) other known diagnostic genes, 434 genes (38.14%) with new de novo candidates, and 62 genes (5.45%) with new compound heterozygous candidates. Gene set enrichment analysis (GSEA) The genes found across all patients in each patient cluster were used as a query set for gene set enrichment analysis (GSEA) using g:Profiler (version e108_eg55_p17)^[268]102. We considered Reactome and KEGG biological pathway gene sets of size <150 genes and set our background gene set to all human genes annotated in Ensembl. Enrichment P values are adjusted using g:Profiler’s g:SCS approach^[269]103. Briefly, for every query gene set size, 2000 random gene sets of the same size are used as queries for GSEA with the same parameters, and the lowest pathway enrichment P value is recorded for each random query set. A threshold t is selected for each query gene set size as the 5% quantile of these random minimum P values. Enrichment P values resulting from the true gene query are then adjusted by multiplying by 0.05/t. Reporting summary Further information on research design is available in the [270]Nature Portfolio Reporting Summary linked to this article. Supplementary information [271]Supplementary Information^ (8.3MB, pdf) [272]41467_2025_61712_MOESM2_ESM.pdf^ (126.6KB, pdf) Description of Additional Supplementary Files [273]Supplementary Data 1-9^ (3.7MB, xlsx) [274]Reporting Summary^ (3.4MB, pdf) [275]Transparent Peer Review file^ (25.5KB, pdf) Acknowledgements