Abstract
Genomics for rare disease diagnosis has advanced at a rapid pace due to
our ability to perform in-depth analyses on individual patients with
ultra-rare diseases. The increasing sizes of ultra-rare disease cohorts
internationally newly enables cohort-wide analyses for new discoveries,
but well-calibrated statistical genetics approaches for jointly
analyzing these patients are still under development. The Undiagnosed
Diseases Network (UDN) brings multiple clinical, research and
experimental centers under the same umbrella across the United States
to facilitate and scale case-based diagnostic analyses. Here, we
present the first joint analysis of whole genome sequencing data of UDN
patients across the network. We introduce new, well-calibrated
statistical methods for prioritizing disease genes with de novo
recurrence and compound heterozygosity. We also detect pathways
enriched with candidate and known diagnostic genes. Our computational
analysis, coupled with a systematic clinical review, recapitulated
known diagnoses and revealed new disease associations. We further
release a software package, RaMeDiES, enabling automated cross-analysis
of deidentified sequenced cohorts for new diagnostic and research
discoveries. Gene-level findings and variant-level information across
the cohort are available in a public-facing browser
([59]https://dbmi-bgm.github.io/udn-browser/). These results show that
case-level diagnostic efforts should be supplemented by a joint genomic
analysis across cohorts.
Subject terms: Statistical methods, Diseases, Data integration
__________________________________________________________________
Using well-calibrated statistical methods the authors jointly analyze
Undiagnosed Diseases Network genomes, identifying known and novel
disease genes. Software is publicly available to support future
cross-cohort rare disease discovery efforts.
Introduction
For decades preceding the widespread application of DNA sequencing,
identifying the genetic etiology of rare monogenic phenotypes including
human diseases relied on segregation in pedigrees^[60]1. DNA sequencing
enabled the analysis of sporadic cases with no segregation data^[61]2.
Early studies analyzed small cohorts of phenotypically similar
cases^[62]3,[63]4, a highly successful approach that is, however,
limited to diseases with multiple known patients with fairly
homogeneous presentations. In the absence of such phenotypically
matched case cohorts, per-case studies of individual, undiagnosed
patients are gaining popularity^[64]5–[65]8. By design, these studies
cannot attain statistical power from the shared genotypes of unrelated
patients and require extensive clinical and biological inquiry to prove
the causal involvement of the genotype in disease^[66]9–[67]11. The
most recent phase of human Mendelian genetics employs a data science
approach to gene discovery propelled by the joint genomic analysis of
phenotypically broad cohorts. Recent studies by the Deciphering
Developmental Disorders and 100,000 Genomes consortia have demonstrated
the power of this approach to identify new diagnoses and disease
genes^[68]12–[69]14. This opens the prospect of international
cross-cohort analyses, leveraging parallel efforts in many countries,
and appreciating that rare diseases know no borders.
Undiagnosed diseases network dataset
Here, we apply existing and newly developed statistical genetics
methods to the Undiagnosed Diseases Network (UDN) cohort that includes
extremely difficult-to-solve, likely genetic cases (Fig. [70]1a–e). The
unique, diagnostically elusive presentation is the only criterion for
inclusion, and patients have varied presentations including
neurological, musculoskeletal, immune, endocrine, cardiac, and other
disorders. Symptom onset ranges from neonatal through late adulthood.
In contrast to most existing rare disease cohorts, individuals accepted
to the UDN have already undergone lengthy but ultimately unfruitful
diagnostic odysseys prior to enrollment. These patients subsequently
undergo extensive phenotypic characterization at UDN clinical
sites^[71]15. Both broad Human Phenotype Ontology (HPO) terms and
highly detailed clinical notes are collected and made available for all
UDN researchers. Phenotypic information includes laboratory
evaluations, dysmorphology examinations, specialist assessments,
surgical records, and imaging (Fig. [72]1f).
Fig. 1. Undiagnosed diseases network cohort analysis.
[73]Fig. 1
[74]Open in a new tab
a Map of clinical and research sites within the Undiagnosed Diseases
Network (UDN) for evaluating patients and candidate variant
functionality. Map created using R’s usmap package. b Genetic ancestry
across the sequenced patient cohort. c Clinician-recorded primary
symptom categories of patients. “Multiple” indicates 2+ categories
could be considered primary and “other” indicates an unlisted category.
Categories marked with an asterisk (*) are neurological subtypes
(Supplementary Note [75]1). d Patient-reported age of first symptom
onset. e Patient sex. f Categories and quantity of phenotype
information collected for patients and made available to all UDN
researchers. g Intronic variants detectable from genome sequencing
(orange star) with a predicted splice-altering impact are considered
alongside exonic variants in our statistical framework; these variants
may result in retained introns or excised exons in processed
transcripts. h We consider genes and gene pathways harboring de novo
and compound heterozygous variants in sequenced trios (72% of all
accepted cases). Other inheritance modes (e.g., homozygous, uniparental
disomy) are not considered in our cohort-based framework. Complete case
count by family structure (e.g., proband-only, duo) is in Supplementary
Fig. [76]2. i Depiction of clinical framework to uniformly evaluate how
well a patient’s phenotypes are concordant with a candidate gene or
variant. All icons in (f) and (i) are from Microsoft PowerPoint.
There is a similar emphasis on collecting sequencing data, with whole
genomes sequenced for probands and their immediate or otherwise
relevant family members. Although smaller than some other rare disease
cohorts^[77]14, the UDN—with a design bridging clinical, research and
functional validation teams and a focus on extreme patient
presentations—was thought to be optimized for case-based analyses where
probands are evaluated individually in depth. Patients’ detailed
phenotypic information, ongoing confirmation of new diagnoses, and the
potential enrichment for novel genetic disorders make for an ideal data
space to validate and develop statistical approaches. We harmonized and
jointly called single nucleotide (SNV) and insertion/deletion (indel)
variants across 4,236 individuals with whole genome sequencing in the
UDN dataset and additionally called de novo mutations from aligned
reads across complete trios (Methods, Supplementary Fig. [78]1)^[79]16.
Clinical evaluation of computational findings
Here, we generate candidate gene–patient matches via a series of
statistical genomic analyses implemented in our software suite, Rare
Mendelian Disease Enrichment Statistics (RaMeDiES, Fig. [80]1g, h). We
focus on the model of monogenic, autosomal inheritance in de novo and
compound heterozygous cases to prioritize candidates via a
genotype-first approach, with no clinical input or phenotypic
information used. Each candidate is then evaluated with respect to the
patient’s clinical presentation and the gene and variant’s putative
role in disease—based on known disease associations, functionality in
model organisms, tissue expression, molecular function, evolutionary
constraint, and in silico predicted pathogenicity—to assess phenotypic
match (Fig. [81]1i). For genes or gene pathways harboring deleterious
variants across multiple individuals, phenotypic similarity between
patients is also assessed. To scale clinical evaluation to the cohort
level, we developed a semi-quantitative protocol guided by the ClinGen
framework^[82]17 that uses hierarchical decision models to increase
efficiency and enables consistent and comparable evaluations of a
gene–patient diagnostic fit by independent experts (Supplementary
Note [83]2, Supplementary Fig. [84]3). We calibrated the protocol
during development by testing whether the resulting clinical scores
assigned by different experts on the clinical team were in agreement.
We validated the protocol in a blind test using non-causative candidate
genes as controls. Specifically, non-causative genes were selected with
identical criteria to true candidate genes, except biallelic variants
were in cis rather than in trans or had low predicted pathogenicity
scores. The clinical team applying the protocol consistently scored
true candidate genes higher than control genes (Wilcoxon one-sided
rank-sum P value = 0.0171, Methods, Supplementary Data [85]1),
suggesting that the scores generated by the clinicians’ protocol may be
used to prioritize candidates.
Results
De novo analysis
Several highly penetrant, extreme phenotypic presentations underlying
Mendelian and other congenital, complex human diseases have been linked
to de novo mutations^[86]12,[87]18,[88]19. We began by evaluating all
independent, sporadic trios with complete sequencing data for de novo
mutation etiologies. We detected 78.3 de novo point mutations and 9.5
de novo indels on average per proband genome concordant with the
expectation^[89]20. Mutation count showed expected dependency on
parental ages with Poisson-distributed adjusted counts, attesting to
the quality of de novo calling (Fig. [90]2a, and Supplementary
Data [91]2, Supplementary Fig. [92]4).
Fig. 2. De novo recurrence.
[93]Fig. 2
[94]Open in a new tab
a De novo mutation counts per proband adjusted for parental ages. Blue
vertical lines show the mean values of the distributions, and curves
represent the Poisson fits. b Schematic of analytical test for the
recurrence of de novos that considers distal splice-altering and exonic
SNV and indel variants, their variant functionality scores, a
genome-wide mutation rate model Roulette, and per-gene GeneBayes
constraint values. “Like” variants refer to those of the same variant
class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS],
intronic indels [II]) and within the same functionality score and minor
allele frequency thresholds. c Genes with highest significance values
for de novo recurrence across the cohort, computed as described in b3
(Pr’( y[g] )) and b4 (Q[g]), when focusing on missense variants with
AlphaMissense and PrimateAI-3D scores; patients are represented as
colored circles. Complete gene list with exact P values can be found in
Supplementary Data [95]3. Note that multiple testing corrections have
been applied in the form of both Bonferroni (**) and FDR (*)
thresholds. d AlphaFold-predicted human LRRC7 protein structure
(AF-[96]Q96NW7-F1) covering the leucine-rich repeat region with high
predicted structural confidence (amino acid positions 86-463). The
fifth and eighth LRR domains where missense de novos were found are
highlighted in blue. Reference alleles for missense de novo variants
observed across two UDN patients (red) are shown in circles. A
depiction of LRRC7’s linear protein sequence (Ensembl ID
ENSP00000498937) with InterPro predicted domains shown in colored boxes
is below. e Overlap of phenotype terms annotated to each patient.
We then sought to identify genes enriched for deleterious de novo
mutations across our patient cohort. Although the UDN dataset contains
sequenced sporadic cases and their unaffected parents, classic
case-control associated burden tests are underpowered in the de novo
regime because of the extreme rarity of these variants, difficulty in
their identification without complete sequenced trios, and noise in
variant sampling from both cases and controls^[97]21. Instead, de novo
goodness-of-fit enrichment tests, which evaluate whether observed de
novo events exceed what would be expected by chance based on mutation
rates, are better suited for studying diseases with ultra-rare
pathogenic variants^[98]18,[99]19. The power of this de novo enrichment
calculation increases with better models of underlying mutation rates
and estimates of variant deleteriousness. Recently, the rate of de novo
emergence has been estimated at basepair resolution with a high degree
of accuracy^[100]22. Newly developed deep learning models for
predicting the pathogenicity of de novo and other variants also now
exhibit unprecedented accuracy in distinguishing disease-relevant
variants^[101]23,[102]24. We leverage these recent advances to build an
accurate, unbiased statistical procedure called RaMeDiES-DN to detect
genes enriched for deleterious de novos.
Unlike the earliest generation of de novo recurrence approaches which
leveraged Poisson approximations for runtime efficiency but could not
take advantage of improved deleteriousness scores and mutation rate
models^[103]18, RaMeDiES methods seamlessly incorporate per-variant
deleteriousness scores and mutation rates without sacrificing runtime.
Briefly, for a given observed variant in a gene, we define its
“mutational target” as the sum of per-variant de novo mutation rates
for all possible variants with as high or higher a deleteriousness
score. By construction, this per-variant mutational target is expected
to be a uniformly distributed statistic, enabling us to leverage
closed-form analytical formulas for computing their cumulative
probabilities (Supplementary Note [104]3). Our framework naturally
combines different variant types including SNVs and indels with a
distinct mutation rate model, and can interchangeably utilize various
deleteriousness scores (Fig. [105]2b, Methods). Although current
state-of-the-art de novo recurrence approaches also incorporate
relevant variant-level information, they rely on a complex, permutation
procedure^[106]12. RaMeDiES’ analytical approach eliminates the need
for permutation-based significance calculations and can process large
datasets in mere seconds while maintaining well-calibrated P values
(Supplementary Fig. [107]5).
We first focus on the subset of missense variants, which comprise a
sizable proportion of known Mendelian disease-causing variants and for
which new, specialized pathogenicity predictions exist (e.g.,
PrimateAI-3D and AlphaMissense)^[108]23–[109]25. We find one
significant gene, KIF21A, corresponding to the correct, complete
diagnosis in one patient and a strong partial diagnosis in one other
(Bonferroni-adjusted Cauchy-combined P value < 0.05, Fig. [110]2c).
Notably, disease genes with a de novo mode of inheritance are expected
to be under strong selection against heterozygous loss-of-function
variants. We further refine our method to incorporate this intuition by
prioritizing genes by their GeneBayes values, which indicate selection
against heterozygous protein-truncating variants, using a weighted
false discovery rate (FDR) procedure^[111]26–[112]28. With this
correction, we obtain three gene findings at an equivalent significance
threshold (Q-value < 3e-6) and eight gene findings at FDR 5%
(Supplementary Data [113]3). Our second and third gene hits, BAP1 and
RHOA, correspond to a known correct diagnosis in one patient and strong
clinical matches in two other patients. Among the five remaining genes
at FDR 5%, three genes (CACNA1C, COL4A1 and NOTCH1) correspond to known
diagnoses in five patients and the top clinical candidate in one
patient. Two impacted patients with de novo missense variants in the
leucine-rich repeat region of LRRC7, a gene expressed in the brain, had
phenotypic overlap of hypotonia and developmental delay; one patient
additionally experienced nystagmus, staring spells, and balance
problems and the second had ataxic gait (Fig. [114]2d–e). We discussed
these findings with the managing clinical teams overseeing these cases,
and these UDN patients have now been included in a clinical case series
linking LRRC7 to an emerging neurodevelopmental
disorder^[115]13,[116]29. Another gene, NRBP1, remains a strong
candidate in two patients due to their neurological phenotype overlap
and NRBP1’s expression in the brain. An initial functional study in fly
through the UDN Model Organisms Screening Core was inconclusive. This
gene has been submitted to MatchMaker Exchange.
We next consider all exonic variants, including nonsense variants and
indels, and further incorporate additional well-established
deleteriousness predictors, CADD and REVEL^[117]30,[118]31. Different
mutagenesis processes lead to indel mutations, so SNV mutation rate
models can be inappropriate for modeling this mutation type for some
genes^[119]32. We therefore constructed a separate per-gene mutation
rate approximation for indels (see Methods for details). When we reran
RaMeDiES-DN on all exonic variants using four deleteriousness
predictors, we additionally identified KMT2B (Bonferroni-adjusted
Cauchy-combined p-value < 0.05), corresponding to a correct diagnosis
in four patients due to de novo indel variants (Supplementary
Data [120]4, Supplementary Fig. [121]6a). The next seven gene findings
at FDR 5% had already been identified when assessing recurrence of
missense variants. At FDR 10%, we identify five new putative diagnoses.
For instance, two patients had high impact missense de novo variants
impacting H4C5, a histone gene that was not detected with significance
in our missense-only enrichment test due to its lack of precomputed
AlphaMissense scores. Both patients had infantile-onset gross motor
developmental delays, dysmorphic facial features, and speech
difficulties (Supplementary Fig. [122]6b,c). These and other phenotypes
exhibited by each patient were recently found to be linked to missense
variants in histone H4 genes^[123]33. For one of the patients, the de
novo variant was contemporaneously interpreted by UDN clinical experts
to be causal^[124]34. The second patient’s de novo variant has now been
reclassified as “pathogenic” and resulted in a new diagnosis for this
participant. Two other patients with sporadic neurodevelopmental delay
each harbor truncating de novo variants in ZNF865. Both patients have
phenotypic overlap with a series of 10+ other patients with ZNF865
mutations, which makes a compelling case for pathogenicity^[125]35.
Subsequent to the publication of the case series, we anticipate this
gene–disease relationship will be established as causal and both
variants to be reclassified as likely pathogenic.
Inclusion of deep intronic splice variants
Next, we demonstrate how RaMeDiES-DN can be extended to additionally
consider non-exonic variants uncovered uniquely from whole genome
sequencing using the same methodological infrastructure. On the one
hand, it remains challenging to identify non-coding regulatory variants
involved in rare Mendelian diseases^[126]36, and the overall role of
such variants in congenital disorders is still a subject of
debate^[127]37. On the other hand, distal gain-of-splice site mutations
creating new acceptor or donor splicing sites deep in the intronic
sequences of genes are now a well-recognized cause of monogenic
disease^[128]38. Identification of splice-altering variants directly
from genome sequencing data is recently possible using newly-developed
in silico predictive scores without relying on RNA sequencing. RNA
sequencing has limitations for diagnosis because it depends on the
availability of relevant tissue material that is especially challenging
to obtain for neurodevelopmental patients, and it may miss
lowly-expressed isoforms and those targeted by nonsense mediated
decay^[129]39. Moreover, identifying disease-causal intronic splice
variants is especially appealing due to their potential targetability
using antisense oligonucleotide therapies^[130]40.
Unlike functional predictions for exonic variants, which have been
extensively validated for consistency and accuracy via decades of
experimental in vitro and in vivo studies, functional predictions of
splice-altering intronic variants are relatively new and still require
experimental confirmation. We used a combined
computational–experimental approach to prioritize distal splice
variants using in silico predicted scores and an in vitro massively
parallel splicing reporter assay (Methods, Supplementary
Fig. [131]7)^[132]41,[133]42. We found the per-variant in silico
predictions to be mostly concordant with the in vitro assay readouts.
Variants assigned higher in silico scores are more frequently supported
by the experimental, in vitro assay, and those with relatively lower in
silico scores (SpliceAI <0.5) have a non-negligible validation rate as
well (Supplementary Fig. [134]8). This prompted us to incorporate the
full range of continuous SpliceAI scores, disregarding only the lowest
scoring variants, in our statistics. We found this approach to consider
distal splice-site variants attractive because it lends itself to a
statistical analysis alongside exonic variants. Once genome-wide
functionality score tracks are released for the next generation of
splice predictors as well (e.g., Pangolin)^[135]43, they can be
integrated into RaMeDiES using the same methodology leveraged for
exonic variant predictors.
No new candidate genes with a significant recurrence of intronic de
novos were found in the UDN dataset. However, by seamlessly
incorporating non-exonic variants within the same statistical test, our
approach enables a more complete, automated analysis of the growing
volume of whole genome sequencing data across rare disease consortia.
We also ran the state-of-the-art de novo enrichment approach,
DeNovoWEST^[136]12. Unlike our approach, DeNovoWEST incorporates a
gain-of-function model alongside a loss-of-function model, which has
the potential to yield additional findings. We equipped the DeNovoWEST
algorithm with the Roulette mutation rate model, up-to-date CADD
variant deleteriousness and s[het] gene constraint scores^[137]27, and
further incorporated deep intronic variants with predicted
splice-altering impact (Supplementary Fig. [138]9). This approach
yielded two Bonferroni-significant genes when applied to the UDN
dataset, one of which was also uncovered by RaMeDiES-DN at Bonferroni
significance and the second at a FDR of 6% (KMT2B and H4C5,
Supplementary Fig. [139]10). We did not apply an FDR-based approach to
DeNovoWEST’s results to consider additional gene findings, because
DeNovoWEST P values are a construct over three sometimes dependent
tests, rendering an FDR adjustment inappropriate. We also find CSMD1, a
highly indel-prone gene, within DeNovoWEST’s top-ranked five genes,
likely because indels and SNVs are not distinguished in the mutation
rate model^[140]44. We then ran DeNovoWEST and RaMeDiES-DN on a larger
cohort of previously published and jointly-analyzed de novo variants
from the Deciphering Developmental Diseases (DDD) study, GeneDx, and
Radboud University Medical Center (RUMC). Both methods recovered known
autosomal dominant disease genes at a comparable rate (Supplementary
Fig. [141]16, and Supplementary Data [142]5), but RaMeDiES-DN required
less than one-tenth the runtime.
As opposed to methods like DeNovoWEST that require individual-level
data, RaMeDiES’ operation at the level of mutational targets enables
sharing of intermediate statistics across cohorts without revealing
patients’ individual variants. The rare disease field has accumulated
many individually small, but collectively large, cohorts that have
never been included in any joint analyses because data governance
restricts external sharing of variant- and patient-level data. As a
proof of concept, we ran RaMeDiES-DN on the individual DDD, GeneDx, and
RUMC de novo cohorts and subsequently combined the three summary-level
mutational target statistics to generate gene findings. This
meta-analysis run resulted in the same per-gene significance values as
compared to RaMeDiES-DN’s run on individual variants from the combined
cohort (Supplementary Data [143]6).
Compound heterozygous variant analysis
We next evaluate compound heterozygous (comphet) variants, which are
the most likely cause of rare recessive disorders in populations with
low degrees of consanguinity, as is largely the case in the United
States^[144]45. Comphet variants are defined as a pair of alleles
landing within the same gene and inherited in trans from unaffected
parents who are also heterozygous at these loci. These inherited
disease-causing variants tend to be rare in the population, due to the
effect of selection against biallelic variant occurrences or against
slightly deleterious phenotypes of heterozygous variants^[145]46.
Despite the expected low frequency of individual alleles comprising a
comphet pair, directly selecting for highly deleterious comphet
variants still results in numerous false positive findings at the
cohort level, motivating a statistical approach for cohort-level
comphet prioritization. Developing a statistical framework analogous to
de novo goodness-of-fit enrichment requires modeling the distribution
of rare inherited alleles per individual. De novo mutations arise
through the universal process of mutagenesis and are therefore
straightforward to model. Similarly, the distribution of the total
number of all derived alleles per haploid genome (i.e., all
non-ancestral variants inherited from one parent without any imposed
frequency constraints) are also not dependent on the demographic
history of the population and therefore are straightforward to
model^[146]47,[147]48. In contrast, however, the distribution of the
total number of rare alleles per individual is highly dependent on
population structure, which is notoriously difficult to account for.
Some previous approaches for determining cohort-level significance of
comphet variants ignore population structure when modeling the number
of rare variants. Although this may be an accurate statistical test in
controlled model organism cross experiments, it is inappropriate for
natural human populations, where population structure is present even
at a very fine scale^[148]49. In the Genome of the Netherlands (GoNL)
dataset for instance, the number of synonymous singletons across
unrelated individuals still reflects geographic structure along a
south-north cline^[149]50.
In our framework, we sidestep directly modeling the distribution of
rare variant counts per individual and instead condition on the
observed number of rare variants inherited from each parent using
trio-level data. Given the number of rare variants inherited from each
parent per individual, we then compute the probabilities of comphet
variants landing in high-scoring positions in the same gene across the
cohort. Although the positions where inherited variants land is
influenced in part by direct and background selection and biased gene
conversion, for very rare variants, the effect of these factors is
negligible compared to the effect of the variation in mutation rate
along the genome and the overall gene target size^[150]22,[151]51. We
therefore model the positional distribution of rare inherited variants
using the same Roulette basepair-resolution de novo mutation rate model
leveraged in our de novo recurrence model. Our comphet recurrence
model, called RaMeDiES-CH, relies on the comphet mutational target,
computed for each comphet variant pair and defined similarly as the de
novo mutational target previously introduced. Specifically, the comphet
mutational target is computed as the total squared mutation rate of all
possible variants with higher functionality scores (Fig. [152]3a).
RaMeDiES-CH applies the Cauchy P value combination approach as before
to leverage multiple variant-level functionality scores while
considering exonic and intronic variants, but does not incorporate gene
constraint scores, which do not exist for recessive selection (Methods,
Supplementary Fig. [153]11)^[154]52. RaMeDiES-CH computes
well-calibrated per-gene P values for comphet variants in a cohort
(Supplementary Fig. [155]5).
Fig. 3. Compound heterozygous variants.
[156]Fig. 3
[157]Open in a new tab
a Illustration of the unnormalized squared mutational target computed
for each observed comphet variant in a gene across the cohort
(RaMeDiES-CH, Supplementary Fig. [158]11) or in an individual across
the genome (RaMeDiES-IND, Supplementary Fig. [159]12). “Like” variants
refer to those of the same variant class (i.e., coding SNVs [CS],
coding indels [CI], intronic SNVs [IS], intronic indels [II]) and
within the same functionality score and minor allele frequency
thresholds. b Top ranked genes resulting in the best enrichment
statistic computed for RaMeDiES-IND. Putative candidates refer to genes
that remain candidates for pathogenicity due to their
phenotypically-relevant tissue expression, but where there is not
enough functional evidence or published gene–disease relationships to
establish causality at this time. c Overlap between phenotypes
associated with MED11 and those exhibited by the affected patient. d
RNA-Seq reads from whole blood samples aligned to first two exons and
first intron of MED11 for proband (black), dad (blue), mom (purple) and
two tissue-matched control samples (gray). Thin green line represents
the intron, solid boxes represent protein-coding exonic regions, and
the dotted box represents the 5’ untranslated region of MED11. (e)
Proband exhibits significant retention of the first intron relative to
parents and fifty-three tissue-matched control samples. Intron
retention ratio is calculated as the (median read depth of first
intron) / (number of reads spanning first and second exons + median
read depth of first intron).
Across the set of non-consanguineous UDN families, we did not find
significant recurrent comphet occurrences across genes. This result is
unsurprising, as previous estimates suggest that in panmictic disease
populations, only one deleterious comphet variant is expected for every
five dominant de novos^[160]49. Nevertheless, RaMeDiES-CH represents an
accurate and unbiased statistical test for the recurrence of comphet
variants in human populations, which can be applied to reveal new
diagnoses as sequenced rare disease datasets expand.
We suspected that singleton disease-causing comphet variants were still
present in the cohort. We adapted our statistical framework to compute
an individual-based statistic, RaMeDiES-IND, that normalizes each
observed comphet variant mutational target across all genes in the
genome rather than across all individuals in a cohort (Supplementary
Fig. [161]12). This approach yielded a ranked list of patient–gene
pairs across the UDN cohort, where each patient–gene pair could be
annotated as corresponding to a correct diagnosis or otherwise
(Supplementary Data [162]7). We computed a single enrichment statistic
for this overall patient–gene ranking, which simultaneously suggested a
threshold for clinical consideration of findings, as the best Fisher’s
exact test P achieved across all positions in the list. This enrichment
statistic was significant when compared to the distribution of the same
statistic computed across 10,000 random shuffles of the patient–gene
list (permutation P value = 0.001, Methods, Supplementary
Fig. [163]13). Among the top thirteen hits yielding this best
enrichment statistic, we recapitulated five known diagnoses (i.e.,
NEUROG3, PAH, COX20, NDUFAF8, PRDX3)^[164]53,[165]54 and newly
identified the genomic cause of a known biochemical diagnosis (i.e.,
ACADM in a patient with MCAD deficiency). We also identified comphet
variants in MED11 which are now leading diagnostic candidates in an
undiagnosed patient experiencing neurodegeneration, developmental
delay, brain abnormalities, chorea, and hypotonia (Fig. [166]3c). MED11
is associated with epilepsy and intellectual disability, and this
patient’s presentation could represent a phenotypic expansion of this
known disorder^[167]55. Both inherited variants occur deep in the first
intron of MED11, a region that would be missed by exome-only sequencing
or analysis, and are predicted to cause cryptic splice donor gains.
Transcriptome (RNA) sequencing of blood samples from the affected
patient and both parents highlighted a significantly higher rate of
first intron retention in the affected patient relative to both parents
and to fifty unrelated blood control samples (Fig. [168]3d–e, and
Supplementary Fig. [169]14)^[170]56.
Our comphet models assume that the two inherited alleles comprising a
biallelic variant pair arose as a result of independent mutations.
Homozygous recessive variants, where the exact same allele is inherited
from each parent, are considered alongside compound heterozygous
variant configurations in cases where the allele independence
assumption is unlikely to be violated. In individuals with
consanguineous parentage, however, the same heterozygous variant in
both related parents most likely arose from a single, ancestral
mutational event. As such, we directly exclude probands with evidence
of parental relatedness from our analysis (Supplementary
Note [171]4)^[172]49.
Pathway analysis
Genes involved in the same pathway are frequently involved in similar
phenotypic presentations^[173]57–[174]60. This provides an enticing
possibility of drawing statistical power from multiple independent
occurrences of deleterious variants in the same functional units,
rather than just in the same genes. Moreover, therapeutics for
disorders of the same functional unit that are individually too rare to
meet minimal participant requirements for clinical trials may be
evaluated together within the same umbrella or basket trial for more
efficient approval^[175]61. However, such an approach should be pursued
with caution, as the phenotypes stemming from perturbations of
different genes in the same functional unit may vary to a great extent.
Such differences in patient presentations may render the clinical
evaluations and therapeutic potential of statistically significant
findings virtually impossible. To mitigate this issue, we first
initially consider groups of patients with similar phenotypes, and then
within each of these groups, assess the overrepresentation of
deleterious mutations across established biological pathways
(Fig. [176]4a).
Fig. 4. Biological pathways enriched within phenotypically-similar patient
subgroups.
[177]Fig. 4
[178]Open in a new tab
a Schematic illustrating the two-step process of first clustering
patients according to the semantic similarity of their phenotype terms
and second finding enriched biological pathways among the genes within
each patient cluster. b The most significant pathways per cluster (P
value < 0.01, adjusted for multiple testing using g:Profiler’s
Statistical Correction Scheme^[179]103) with 1+ genes from 1+
undiagnosed patients; complete list in Supplementary Data [180]8. c Two
patients with primarily immune-related symptoms each harbored a
compelling de novo variant in genes involved in immunoproteasome
assembly (POMP) and structure (PSMB8). Their symptoms strongly overlap,
and a subset of these symptoms were also known to be associated with
either gene in OMIM. d Three neurological patients had variants in
transmembrane genes involved in the same pathway. These patients had
substantial phenotypic overlap with each other, as expected, and with
the phenotypes associated with each of their genes (depicted as star
shapes in the upset plot). All icons in (a), (c) and (d) are from
Microsoft PowerPoint.
We start by clustering 2662 affected patients—with or without
sequencing data—into 120 groups (median = 17, min = 2, max = 97
patients per cluster) based on the semantic similarity of their
phenotype terms. Within each cluster, we then combine our de novo
candidates, compound heterozygous candidates and known UDN diagnoses
and perform gene set enrichment analysis (Methods, Supplementary
Data [181]8). We focus our attention on undiagnosed cases with de novo
or compound heterozygous candidates within enriched pathways in each
cluster (Fig. [182]4b). We also report all enriched pathways including
those with only diagnosed patients for potential therapeutic grouping
(Supplementary Data [183]9).
Two of three total candidate genes in one cluster with 19 immunological
disorder patients are both involved in the immunoproteasome complex
(KEGG:03050, n = 46, adjusted P value = 4.42e-3). One patient’s genome
contained a known diagnostic, de novo frameshift variant in POMP, an
immunoproteasome chaperone protein^[184]62. An undiagnosed patient with
evidence of chronic inflammation, recurrent infections, and skin
lesions had a missense de novo in PSMB8, a component of the
immunoproteasome β-ring with overlapping phenotypic associations
(OMIM:256040). Both patients had similar combined immunodeficiency
beyond what was captured in their standardized phenotype terms,
including decreased global antibodies, decreased B cells and natural
killer cells, and retained T cell functionality (Fig. [185]4c).
Disruptions to immunoproteasome assembly and structure have been shown
to lead to an accumulation of precursor intermediates, impaired
proteolytic activity and subsequent uncontrolled inflammation^[186]63.
In another cluster of 15 similarly presenting neurological patients,
three candidate transmembrane genes were represented in the same
functional pathway named for some genes’ known involvement in taste
transduction (KEGG:04742, n = 85, adjusted P value = 7.45e-3). Two of
these genes, CACNA1C and GABRA3, harbored high impact de novo and
hemizygous missense variants respectively, corresponding to known
patient diagnoses^[187]64,[188]65. The genome of another, undiagnosed,
now deceased patient from this cluster with no prior candidate variants
contained a synonymous de novo variant predicted to alter splicing in
another gene in the same functional pathway, HCN4 (Fig. [189]4d). All
three patients exhibited seizures at a young age, speech delays, severe
hypotonia, spasticity and visual impairment. Mouse knockouts of HCN4
demonstrate neurological phenotypes^[190]66,[191]67. In humans, HCN4 is
expressed in the visual and nervous systems and has recently been
associated with infantile epilepsy, suggesting that this patient’s
undiagnosed disorder plausibly represents a phenotypic expansion of
this gene^[192]66,[193]67.
Discussion
In total, we analyze 886 sporadic or suspected recessive cases with
complete trio or quad genome sequencing alongside an additional 463
phenotyped, diagnosed individuals using computational methods to
identify de novo recurrence, compound heterozygosity, and pathway
enrichment. We establish five new diagnoses and three new putative
diagnoses in known disease-causing genes or genes previously unlinked
to these patients’ exact presentations. Our prioritization framework
for pathway analysis further recapitulates 70 known de novo and 10
known comphet diagnoses and suggests 82 de novo and eight comphet
candidates for follow-up (Methods, Supplementary Data [194]8).
In the field of common disease genetics, statistical inference of
disease-associated genomic loci is confidently regarded as primary
evidence for their causality. Rare disease genetics, in contrast, is in
a transition state. Due to a lack of large disease-matched cohorts,
case-based analyses of individual probands relying heavily on detailed
patient phenotyping and clinical intuition have typically been used to
generate candidate variant hypotheses. Evidence required to shift these
variants from uncertain significance to known pathogenic status comes
from experimental, functional studies and by identifying additional,
unrelated, genotype-matched individuals with similar phenotypes through
variant matchmaking services such as MatchMaker
Exchange^[195]68,[196]69. Recently, analyses of large,
broadly-phenotyped cohorts of rare disease patients have demonstrated
the potential for statistical approaches to reveal diagnoses and
generate new gene discoveries in the rare disease space as
well^[197]12,[198]14,[199]70.
Although the genome is a big place, it is also a finite space with
respect to gene regions impacted by simple variants such as SNVs and
short (<10 basepairs) indels. This suggests that, in theory,
recurrence-based statistical methods applied to sufficiently large
sequenced cohorts of rare disease patients, even those with diverse
phenotypic presentations like the UDN, will enable the eventual
discovery of all causes of prenatally viable monogenic disease stemming
from these variant types. In order to take statistical discoveries as
primary evidence, as is the case for common diseases, we need accurate,
well-calibrated statistical methods^[200]71. Even slight model
misspecification may propagate and exacerbate the rate of false
discoveries. The rapid growth of genomic datasets on which these models
may be applied, coupled with an ongoing difficulty in phenotyping
patients at scale to confirm findings^[201]72, further increases the
urgency for more rigorous models.
Here we show that well-calibrated statistical models can be built for
both de novo and compound heterozygous modes of inheritance. Although
novel disease–gene discovery from large, phenotypically- and
genetically-homogenous cohorts has been demonstrated, we show here that
rigorous analysis of a diverse, moderately-sized disease cohort at the
gene and the pathway level shows promise.
Extending our RaMeDiES statistical framework to consider variant types
with less accurate mutation rate models or pathogenicity predictions
may lead to a sacrifice of precision. We therefore acknowledge the
limitations of our statistical approach for comprehensive rare disease
diagnosis efforts using short-read sequencing data, and suggest that
our cohort-level models be applied alongside existing case-based
variant prioritization pipelines for additional
statistically-meaningful findings.
First, although our models integrate non-coding variants with predicted
splice-altering impacts, they do not consider potentially functional
variants within whole genome data that fall into untranslated gene
regions, RNA-coding genes or between genes, as genome-wide tracks of
verifiable deleteriousness scores do not yet exist for these variant
types. Improvements to and precomputed scores for these variants will
be beneficial for interpretation efforts in general and can be
leveraged in future iterations of RaMeDiES. Our statistical analysis
also does not consider structural, large indel, copy number, or tandem
repeat variants, as their identification from short-read sequencing
data is computationally expensive and often inaccurate. Ongoing efforts
to generate long-read sequencing data within the UDN and elsewhere
should enable improved identification and analysis of pathogenic
complex variants^[202]73,[203]74. Developing a statistical model for
these variants will still require accurate mutation rate estimates for
these variant types, which is lacking. GnomAD-SV represents a promising
iteration toward this goal, but is still highly dependent on their
specific variant calling pipeline and data rather than biological
mutagenic processes^[204]75.
The presented method considers only autosomal de novo and compound
heterozygous inheritance patterns due to complications in modeling
other disease-relevant inheritance patterns. First, it is difficult to
propose a statistical model for biallelic variant counts in
consanguineous and founder populations, because these counts strongly
depend on the ancestral population history and inbreeding patterns. A
more appropriate statistical approach for assessing recurrence of these
variants would be the extension of parametric linkage applied to very
large cohorts^[205]76. Second, inclusion of hemizygous or other
X-chromosome variants requires accurate sex-chromosome variant calling,
which is notoriously error prone, as well as an accurate mutational
model of the X chromosome, which is complicated due to sex-dependent
selection and random X-inactivation^[206]77. Finally, although we do
not model parental mosaicism or uniparental disomy in our recurrence
statistics, these inheritance patterns and events are regularly
assessed via complementary, traditional case-based approaches^[207]10.
Even though genomic sequencing has been liberalized, currently many
analyses are still restricted to individual programs, and regulatory
and technical barriers prevent sharing individual-level variant data
broadly. In contrast, there are avenues for sharing some variant-level
data in a way that is easily accessible to clinical geneticists.
MatchMaker Exchange, for instance, enables the sharing of specific
variants prioritized through case-based analyses with the goal of
finding new genotype- and phenotype-matched patients. Broadening the
success of MatchMaker Exchange to include variants that may not have
risen to the level of strong candidates in analyses of individual
patients is desirable. We developed a browser containing our gene-level
findings and variant-level information about rare genetic variation in
UDN patients ([208]https://dbmi-bgm.github.io/udn-browser/). In
addition, we provide an open-source software package, RaMeDiES,
implementing the efficient and well-calibrated statistics for de novo
recurrence and deleterious compound heterozygous inference proposed
here. RaMeDiES’ operation on shareable summary statistics rather than
on variant-level data enables automated, deidentified cross-analysis of
substantial existing yet siloed sequenced cohorts for new diagnostic
discoveries. As the Mendelian genomics field continues the transition
to this new data science phase, the methods we present here should
facilitate the exciting prospect of international cross-cohort
analyses, resulting in new findings and a vastly improved rare disease
diagnostic rate globally.
Methods
This work was performed in accordance with all ethical guidelines
outlined in the NIH IRB #15HG0130. The study proposal and manuscript
were approved by the UDN Publications and Research Committee. All
participants have provided written, informed consent for the sharing
and use of their data in this study.
Undiagnosed diseases network (UDN) structure
The Undiagnosed Diseases Network (UDN) was established in 2014 with the
goal of uncovering clinical diagnoses and novel disease-causing genetic
variants and their molecular functionalities. In its current phase, the
UDN is composed of 12 clinical research centers across the United
States and a CLIA-certified sequencing core at Baylor Genetics. Typical
UDN patients have already endured a multiyear “diagnostic odyssey” of
extensive prior testing by multiple medical specialists and often
inconclusive targeted, whole exome and even whole genome sequencing at
the time of their application to the UDN.
As part of the application process, a team of clinicians and genetic
counselors at one of the UDN clinical sites reviews the patient’s
medical records, referral letters and lab data and creates an
abstracted case review document. If the team concludes that a UDN
evaluation may aid in the identification of a diagnosis, the patient is
accepted to the program and undergoes a thorough in-person evaluation
at their assigned clinical site. Most patients and available affected
and unaffected family members receive whole genome sequencing (GS) as
well. All genomic sequencing data, clinical sequencing reports prepared
in accordance with the American College of Medical Genetics and
Genomics (ACMG) variant classification guidelines, structured
phenotyping in the form of Human Phenotype Ontology (HPO) terms, lab
results, imaging data, medication data, referral letters and clinical
notes, the abstracted case review document, and candidate variants and
genes are uploaded to the UDN Data Management and Coordinating Center.
All patients enrolled in the UDN have consented to the broad sharing of
all their genomic, phenotypic and clinical data with researchers
network-wide for use in research projects and when evaluating
gene–phenotype fit for a specific patient and candidate gene. Moreover,
UDN patients have consented to follow-up if additional tests or
information are deemed useful.
Harmonization of whole genome sequencing data
Short-read whole genome sequencing was performed between 2014 and 2022
in accordance with the UDN Manual of Operations, which specifies that
the average coverage across the genome must be >40X, and >97.5% of all
coding and noncoding genes (UTRs, coding regions, and intronic regions)
must be covered at >20X. Paired-end FASTQs were retrieved in June 2022
for 4268 samples collected from 4236 unique individuals. Six
individuals subsequently dropped out of the UDN program and are
excluded from the analyses presented here. All FASTQ pairs were within
expected parameters
([209]https://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
version 08-01-19) and were aligned to human reference hg38 (with decoys
and all alt contigs) using the Sentieon^[210]78 bwa-mem implementation
via the Clinical Genome Analysis Pipeline (CGAP,
[211]https://cgap.hms.harvard.edu/, version 29cefcec). Read groups were
added via a custom CGAP script, multiple FASTQ pairs corresponding to
the same sample were merged, and resulting BAMs were sorted, deduped,
and recalibrated using a Sentieon implementation. GVCFs were produced
using CGAP’s implementation of GATK’s HaplotypeCaller. All processing
steps from FASTQ to GVCF were deployed on spot instances in Amazon Web
Services (AWS). GVCFs were then egressed to the Harvard Medical School
institutional cluster. SNVs/indels were jointly called across genomic
shards then merged using Sentieon tools (version 202112.02). Per-sample
sex and cross-sample relatedness were confirmed using Somalier (version
0.2.15, Supplementary Fig. [212]1)^[213]79. We required that all trios
under consideration in our analysis had two parents reported as
“unaffected”, a child reported as “affected”, parent–child relatedness
0.5 ± 0.075, parent–parent relatedness <0.15, mothers had heterozygous
variants present and a scaled mean depth of ~2 on chromosome X, and
fathers had a scaled mean depth of ~1 on chromosome Y. All variants
were annotated using Ensembl VEP (version 108) and slivar (version
0.2.7) for TOPMed and per-population gnomAD (versions v2.1.1 and
v3.1.2) variant frequencies and homozygote counts^[214]80,[215]81.
For our compound heterozygote analysis, we inferred within-family
regions of identity by descent (IBD) using KING (version
2.3.0)^[216]82. We required at least one IBD region between the child
and each parent to further confirm relatedness (in addition to kinship
coefficient filtering) and no IBD regions of length > 3 Mb between
parents to confirm non-consanguinity between parents. In families with
multiple affected siblings, we select one sibling as the proband and
disregard the other siblings during initial analyses. Variants in other
affected siblings were then used to check segregation during validation
of our findings. This process resulted in 846 non-consanguineous trios
with an affected child and two unaffected parents for our analyses. We
chose to stringently exclude individuals with evidence of familial
consanguinity (i.e., by imposing a parental relatedness and IBD region
length constraints) rather than excluding patients based on their
relative recessive burden because an assumption of our statistical
models is violated in consanguineous cases (Supplementary Note [217]4).
Clinical evaluation framework
Protocol overview
We developed a clinical analysis protocol to reduce subjectivity in the
assessment of diagnostic candidates. We used the case evaluation
process implemented at Brigham Genomic Medicine as a
foundation^[218]83. We then transformed this process into a systematic
and structured protocol with inspiration from the gene–disease
association criteria developed by the Clinical Genome Resource
(ClinGen) group^[219]17,[220]68. Evidence in support of or against a
candidate variant–participant match was evaluated by a team of clinical
geneticists according to three categories for experimental evidence not
taken into account by our statistical analyses: (i) model organism or
cell line studies, (ii) tissue expression, and (iii) protein molecular
function. Clinicians also took into account case-level data and
published literature with case-control data including (iv) known
disease associations, (v) gene evolutionary constraint, and (vi)
variant pathogenicity. Discrepancies in opinion were mediated by joint
discussion until a consensus decision was reached. A detailed
description of the protocol and scoring scheme is available in
Supplementary Note [221]2 and hierarchical decision trees to streamline
the scoring process are provided in Supplementary Fig. [222]2.
Clinical score calibration
We ensured that the protocol was specific and detailed to the extent
that different clinicians with access to the same patient data would
independently assign equal clinical scores to the same candidates. Over
the course of two months, at least two clinicians each evaluated 2–3
compound heterozygous candidates per week and independently recorded
their notes, final clinical scores, and score rationale in a REDCap
database^[223]84. At weekly joint discussions, they iteratively updated
the protocol to improve specificity and reduce discrepancies in
scoring. The final two joint discussions confirmed that categorical and
final clinical scores assigned by different clinicians were
consistently in agreement.
Validation
The clinical team was provided with ten “candidates” and ten “decoys”
from real UDN patients in random order to evaluate. The team was
blinded to gene labels, variant inheritance and SpliceAI score
information during evaluation. “Candidate” genes had two rare variants
(gnomAD popmax AF < 0.001) inherited in trans where one variant was
exonic with CADD > 23 and the second variant was intronic with a max
SpliceAI > 0.3. “Decoy” genes were selected with identical criteria
except that variants were actually inherited in cis or the intronic
variant had a maximum SpliceAI score of 0. After assigning final
clinical scores to each of the 20 genes, the candidate/decoy labels
were revealed to the clinical team (Supplementary Data [224]1).
Identification of de novo variants
For each of the 1463 sequenced trios in our harmonized UDN dataset,
including trios with unaffected offspring, we select the subset of
variants with read depth >10 and genotype quality (GQ) >20 across
proband, mother and father. We further subset to variants with a “high”
Roulette quality score, gnomAD population maximum allele frequency
<0.01, TOPMed^[225]85 allele frequency < 0.01, proband alternate allele
read depth >4 and frequency > 0.2, and alternate read depth <2 in both
parents.
We then utilize observed aligned reads across each trio and across
thirty unrelated individuals to assign posterior probabilities to each
putative de novo variant on autosomes using the CGAP reimplementation
of novoCaller ([226]https://cgap.hms.harvard.edu/)^[227]16. We consider
all de novos with a novoCaller posterior probability >0.7 to be high
confidence, noting that thresholding the novoCaller posterior
probability from 0.5 to 0.95 has negligible impact on the number of
passing variants overall and per-proband (Supplementary Fig. [228]6a).
We further exclude probands with over 150 high confidence de novo
calls, as these patients frequently had “suspected parental mosaicism”
mentioned in their clinical records. Finally, because clonal sperm
mosaicism may lead to siblings inheriting identical de novo variants,
we exclude duplicate de novo variants within each family from
downstream recurrence analyses^[229]86. This process resulted in 1072
trios with an affected proband and unaffected parents for further
analysis.
Analytical test for de novo cohort-level recurrence
Basic statistic definition
We define a cohort as a set of N genomes (i.e., collections of genes)
each with sets of de novo variants arising independently but based on
the same background de novo mutation rate. Let
[MATH: μi
:MATH]
denote the de novo mutation rate of a specific variant i. The
mutational target of a gene g is
[MATH:
μg=∑i∈g
μi.
:MATH]
1
The mutational target of a variant
[MATH: v :MATH]
in gene g is
[MATH:
μg,v=∑i∈g
μi1sc
orei≥score
v :MATH]
2
where
[MATH:
scorei :MATH]
is the deleteriousness score of variant i. Intuitively, the more
surprising and/or deleterious a variant, the smaller its mutational
target. By definition, variant mutational targets are uniformly
distributed from 0 to
[MATH: μg
:MATH]
, so
[MATH:
μg,vμg~U
mrow>[0,1] :MATH]
3
ppose there are K de novo variants falling within gene g across the
cohort, where
[MATH: K≥1 :MATH]
. We define a statistic y as
[MATH: y=∑i=1K1−μg,viμ
g
:MATH]
4
Note that Y is a sum of K uniformly distributed variables on [0,1]
under the null. The distribution of Y given parameter K can thus be
modeled by the Irwin-Hall (IH) “sum of uniforms'' distribution, which
has a closed form for its cumulative density function (CDF) and thus
also for its survival function (SF), where
[MATH: SF=
1−CDF :MATH]
^[230]87. This enables us to replace permutation-based significance
evaluations and instead analytically compute the probability of
achieving a Y as high or higher than observed with K variants using the
IH survival function as
[MATH:
Pr(y∣K)=IHSF(y<
/mi>∣K) :MATH]
. We note that there are many other constructions over a set of
uniformly-distributed random variables (such as P
values)^[231]88,[232]89. We further note that as the cohort size
dramatically increases, the Irwin-Hall distribution can be replaced
with the normal distribution.
Finally, we also modelPr(K), the probability of K independent de novo
variants to land in gene g given this cohort of size N, to assign an
overall significance value to our statistic y as
[MATH: Pr(y)=∑K=1∞Pr(y∣K)
Pr(K)
. :MATH]
5
Because neither y nor Pr(K) are defined for
[MATH: K= 0 :MATH]
, we do not expect Pr( y) to be uniformly distributed. Instead, only
[MATH: Pr(y∣K≥1)=P
r(y)/Pr(K≥1) :MATH]
is expected to be uniformly distributed (Supplementary Fig. [233]9).
In a single genome with
[MATH: n :MATH]
total observed de novo variants, the number of de novo variants to land
in a particular gene g, given that
[MATH:
μg≪1 :MATH]
, is Poisson distributed, parameterized by the expected number of de
novos
[MATH:
λ=nμg<
/mi> :MATH]
. In a cohort of N genomes, the number of de novo variants to land in
gene g is therefore a sum of N Poisson-distributed random variables,
which itself is also Poisson distributed. We thus compute
[MATH:
Pr(K)
mrow>=Pois(
K∣λ) :MATH]
, where λ is given by
[MATH: E[K]≡λ=μg∑j=1Nnj. :MATH]
6
Different deleteriousness scores for coding and intronic variants
We use continuous, per-variant deleteriousness scores that are
precomputed and publicly-available for all possible variants
genome-wide in our computations. Precomputed scores are required for
the calculation of comprehensive, basepair-resolution mutational
targets as described above. For missense variants, we interchangeably
use AlphaMissense (version hg38 released with their 2023 publication),
PrimateAI-3D (academic license, accessed May 2024), CADD (version 1.6),
and REVEL (accessed May 2024).23,24,30,31 CADD is also used for scoring
all other exonic variants, including nonsense and indel variants. For
intronic variants, we use SpliceAI (academic license, accessed May
2021).41 We use different variant functionality scores for exonic and
intronic variants because we found that these values are poorly
correlated with each other in intronic space (Supplementary
Fig. [234]15). Clinical sequencing centers also regularly report these
scores, suggesting their relevance in rare disorders^[235]68,[236]90.
Different mutation rate models for SNV and indel variants
We use Roulette de novo mutation rates for SNVs genome-wide. Different
mutational processes lead to indel mutations, so Roulette values cannot
necessarily be adapted to model this mutation type.32 We approximated
per-gene joint distributions of indel mutation rates and
deleteriousness scores as follows. First, we considered all possible
exonic indels of length ≤10nt for which precomputed CADD scores were
available for download and all possible intronic insertions of length
1nt and deletions of length ≤4nt for which precomputed SpliceAI scores
were available for download. Although SpliceAI provides predictions
exhaustively for all possible indels, CADD provides scores for the
subset of indels observed in gnomAD-v2. We excluded all indels that
overlapped with any SNVs assigned a Roulette “low quality'' filter,
which are based on gnomAD quality metrics, abnormal density of
segregating sites, and suspicious patterns of recurrence. We further
excluded indels with a gnomAD popmax MAF >0.1% and/or a number of
alleles in gnomAD (AN) in the bottom decile. For exonic and intronic
variants separately, we binned all indels by their precomputed CADD or
SpliceAI score rounded to the nearest hundredth. The total number of
indels within a deleteriousness score bin and all bins corresponding to
higher deleteriousness scores was used as an approximation to the
mutational target associated with that score.
Incorporation of different variant types
Because there are different deleteriousness scores for coding and
intronic variants and different mutational targets for SNV and indel
variants, we expand our basic test statistic to accommodate different
variant types
[MATH: t∈ :MATH]
{coding SNV, coding indel, intronic SNV, intronic indel}. We redefine a
gene and variant mutational target with respect to each variant type as
[MATH:
μg,t=∑i∈g,
mo>tμi :MATH]
7
and
[MATH:
μg,v,<
/mo>t=∑i∈g,
mo>tμi1sc
orei≥score
v :MATH]
8
where g,t refers to the subset of all possible variants in gene g of
type t. We define y' as
[MATH:
y′=∑t∑i=1
K1−μg,vi,tμg,t<
/msub>~IHy′∑t<
mi>Kt :MATH]
9
where K[t] is the number of observed de novo mutations of variant type
t landing in gene g. The expected number of de novos to land in gene g
when considering different variant types is
[MATH: E[K]≡λ′
mo>=∑t<
mi>μg,t∑j=1Nnj,t
msub> :MATH]
10
where
[MATH:
nj,t :MATH]
denotes the total number of observed de novo variants of type t in an
individual j. For each variant type t, we scale
[MATH:
μg,t :MATH]
such that
[MATH:
∑g
μg,t= 1 :MATH]
. We compute
[MATH:
Pr(K)
mrow>=Pois(
K∣λ′)
:MATH]
.
Cauchy-combination of P values computed with different deleteriousness
predictors
We can run our method using different deleteriousness score predictions
for coding SNVs (i.e., AlphaMissense, PrimateAI-3D, CADD, or REVEL),
resulting in slightly different lists of genes with corresponding P
values when incorporating this variant type. We combine these lists
using the Cauchy combination test, an analytic calculation that is
applicable under arbitrary dependence structures^[237]89.
Incorporation of GeneBayes values
We incorporate GeneBayes values, which estimate the selection against
heterozygous protein-truncating variants per gene, as weights in a
weighted false discovery rate (FDR) procedure^[238]26,[239]91. We sort
all genes in ascending order by their GeneBayes values. We then
separate these sorted genes into 10 equally sized decile bins. For each
gene g in each bin b[g], we compute a weight
[MATH: wg
:MATH]
as
[MATH:
wg=10⋅∣DD∈b[g]
∣∣DD∣ :MATH]
11
where DD is the set of exclusively dominant disease-causing genes as
annotated in OMIM (accessed December 2023). Genes without GeneBayes
values are assigned a weight
[MATH: wg= 1 :MATH]
. Note that
[MATH:
E[w
g]=1 :MATH]
and that GeneBayes values, which are constant for all variants within a
given gene, are independent from y and y' values, which vary for
variants within a gene based on variant mutational targets and
deleteriousness scores. This enables us to perform Benjamini-Hochberg
false discovery rate correction on weighted Q-values computed for each
Pr(y') as
[MATH:
Q=Pr(y′)/wg :MATH]
.26
Massively parallel splicing reporter assay (MPSA)
Assay design
We designed oligonucleotides to evaluate the impact of a variant
predicted to cause a cryptic splice site gain or a canonical splice
site loss. For each variant with a predicted splice-altering impact, we
extracted the surrounding genomic sequence from the UDN patient
harboring the variant (alternate) as well as a paired version with the
variant of interest replaced with the reference allele (reference). We
centered the candidate sequence on the variant of interest, noting that
the impacted splice site junction could be up to 50 nucleotides away
from the variant. For a subset of variants, we also generated candidate
sequences that were centered on the predicted site of the altered
splice junction rather than on the variant itself. We embedded each
candidate sequence in an oligonucleotide template containing a 4-nt
barcode and flanking primers as follows:
Splice donor library structure
GCACGGACAAAGTACTAGCC [155-nt candidate sequence][4-nt SD-associated
barcode] GGAAGATCGACGCAGgtaagt
Splice acceptor library structure
TGCTCTTATGCGAACGTGTTAAC [4-nt SA-associated barcode] [151-nt candidate
sequence] GGAAGATCGACGCAGgtaagtt
The final oligonucleotide library contained 6000 200-nt
oligonucleotides, encompassing 1920 alternate/reference pairs, which we
ordered from Twist Bioscience.
Library cloning and experimental protocol
The oligonucleotide library was cloned separately using PCR
amplification and NEBuilder assembly into lentiviral splice acceptor
(pLenti-MPSA-acceptor) and splice donor (pLenti-MPSA-donor) vectors.
These vectors consisted of an EF1A promoter and an mCherry open reading
frame (ORF) followed by splicing reporter modules based off of prior
published massively parallel splicing reporter
constructs^[240]92,[241]93 (Supplementary Fig. [242]7) as well as a
separate Puromycin selection cassette. Plasmids have been deposited to
Addgene under accession numbers 240805 (“plentiMPSA SAentry PuroR'')
for splice acceptors and 240806 (“plentiMPSA SDentry PuroR'') for
splice donors.
Lentiviral particles for each library were produced and titrated. Each
library was transduced at a multiplicity of infection (MOI) of 0.3 in
three biological replicates into 6.25*106 cells/replicate of HepG2
(liver cell line derived from primary cells extracted from a white,
male, 15-year-old with liver cancer, catalog #HB-8065) and SK-N-SH
(neural-like cell line derived from primary cells extracted from a
female 4-year-old, catalog #HTB-11), both acquired from American Type
Culture Collection (ATCC). Cells were routinely tested for mycoplasma
contamination via qPCR, and all tests were consistently negative for
mycoplasma. Cells were selected with Puromycin to completion, and
genomic DNA and RNA were harvested one week after transduction.
PCR-based nextgen sequencing (NGS) library preparation was performed on
all 12 genomic DNA and RNA samples. Libraries were sequenced with 75-nt
paired-end reads using an Illumina NextSeq 500 sequencer, ensuring an
average of >1000 reads per library member from all libraries.
Barcode mapping
Over ~75% of all RNA reads could be mapped back to a 15-nt barcode
found in our starting dictionary. This resulted in ~6—15 million mapped
RNA reads per MPSA replicate, yielding a median of 1170 mapped reads
per alternate/reference library pair per replicate. Results from
Tapestation, an automated electrophoresis system for sizing and
quantifying nucleic acid samples, showed that 49.6% of mapped reads
from splice donor MPSA experiments utilized some library splice donor
site and 50.4% utilized the experimentally fixed site. Across splice
acceptor MPSA experiments, 58.3% of mapped reads utilized some library
splice acceptor site and 41.7% utilized the fixed site.
MPSA validation rate
We considered all alternate/reference library pairs with at least 10
barcode-disambiguated mapped reads each in one or more MPSA
experiments; 99.4% of pairs met this requirement. Each read was then
categorized as (1) using the experimentally fixed splice site, (2)
using a splice site corresponding to a known intron/exon junction as
annotated in Ensembl, (3) using the SpliceAI-predicted cryptic splice
gain site, (4) using a cryptic splice site at a different location, (5)
malformed where the read did not begin with the correct fixed sequence
due to a next-generation sequencing error, or (6) recombined where the
read did not align to the expected oligo sequence at all. The percent
of malformed and recombined reads per alternate/reference pair was 7.5%
(SD=1.9%) and 6.2% (SD=10.6%) respectively on average. The position of
SpliceAI-predicted cryptic splice sites often did not correspond to the
expected splice junction based on manual inspection or to the splice
sites observed in MPSA experiments (55.4% of splice acceptor and 5.4%
of splice donor predicted positions matched). We instead considered the
most common cryptic splice site position observed in each alternate
library sequence to be the predicted site. MPSA validation rate is
computed per alternate/reference library pair as the difference in
percentages of total reads supporting the predicted cryptic splice site
between oligos containing the alternate variant and the corresponding
reference oligonucleotides (Supplementary Fig. [243]8a).
We compared the MPSA validation rates across the three biological
replicates and two cell types using Pearson's correlation
(Supplementary Fig. [244]8b).
DeNovoWEST gene-specific enrichment of de novo variants
We modified the DeNovoWEST weighted permutation test by first
augmenting the set of variants under consideration beyond exonic
variants to include all possible intronic variants in protein-coding
genes with a SpliceAI score >0.4, resulting in ~400k additional
possible variants under consideration.41 To this end, we modified the
codebase (version v1.0.0) to consider these intronic putatively
splice-altering variants to have the same functional consequence as
canonical splice site variants if they had a VEP annotation of
“splice_acceptor'' or “splice_donor'' or the same functional
consequence as missense variants otherwise. We then updated the
required precomputed values, including per-variant mutation rates,
minor allele frequencies, deleteriousness scores and per-region
constraint values as detailed below, for all exonic and intronic
variants under consideration (Supplementary Fig. [245]9a). The
underlying triplet-context mutational model was replaced with
genome-wide, per-SNV Roulette mutation rate estimates^[246]94. Each
variant's minor allele frequency was set to the maximum gnomAD-v3
population or TOPMed allele frequency. Per-variant Phred-scaled and
unscaled CADD values were obtained from
([247]https://cadd.gs.washington.edu/) (version 1.6 for GRCh38/hg38).
Updated per-gene shet values were obtained from
([248]http://genetics.bwh.harvard.edu/genescores/selection.html) and
binned into a “low'' category if mean shet was below 0.15 and a “high''
category otherwise.27 Notably, some stable Ensembl gene IDs in
GRCh37/hg19 are not present in GRCh38/hg38 and vice versa; all variants
from the 894 GRCh38/hg38 genes without shet values are binned into the
“low'' category. Regional missense constraint values, defined for
adjacent windows covering the full genomic region of each
protein-coding gene were obtained from
([249]https://gnomad.broadinstitute.org/downloads#exac-regional-missens
e-constraint). We translated these genomic region coordinates from
GRCh37/hg19 to GRCh38/hg38 using UCSC's LiftOver tool and then assigned
a constraint value to exonic and intronic variants corresponding to the
genomic region they fell into. We recomputed the weights assigned to
each variant type using the union of all de novo variants in our cohort
and the de novo variants released with DeNovoWEST (encompassing ~31,000
exome-only trios), because the distribution of de novo variant classes
in UDN data was very similar to the distribution of de novo variant
classes in the dataset used by DeNovoWEST (Supplementary
Fig. [250]9b-c) and because the authors warn that weights generated
from smaller datasets alone may be unreliable. Gene severity scores
were then computed for every gene harboring one or more de novo
variants across our cohort. We adjust DeNovoWEST assigned p-values
using Bonferroni correction for twice the total number of genes
evaluated as suggested by the authors. We find that DeNovoWEST and
RaMeDiES-DN (using only CADD in exonic regions as a closer comparison
to DeNovoWEST) recovered known autosomal dominant disease genes at a
comparable rate across de novo variants provided in the original
DeNovoWEST paper (Supplementary Fig. [251]16).
Analytical test for compound heterozygous cohort-level recurrence
A compound heterozygous configuration is an independent occurrence of
two variants: one maternally (M) and the other paternally (D)
inherited. Here we use the term “compound heterozygous configuration''
to refer to either two distinct heterozygous variants inherited from
either parent, or a single homozygous recessive variant where both
independently-inherited alleles arose from independent mutational
events (Supplementary Note [252]4). The mutational target of a compound
heterozygous configuration should therefore lie in a space of squared
mutational targets. We define the mutational target of a compound
heterozygous configuration as
[MATH:
μg,
vM,v<
/mi>D=ma
x(μg,vM
mrow>,μ
g,vD)2
mrow> :MATH]
12
where
[MATH: vM
:MATH]
and
[MATH: vD
:MATH]
are maternally and paternally inherited variants comprising a compound
heterozygous configuration, and
[MATH:
μg,
vM :MATH]
and
[MATH:
μg,
vD :MATH]
are computed as in Eq. [253]12. To prioritize compound heterozygous
configurations with both deleterious variants, we use the maximum over
per-variant mutational targets. A single deleterious variant in a
compound heterozygous configuration may indicate carrier status rather
than a compelling candidate for a rare disorder. By this definition,
[MATH:
μg,
vM,v<
/mi>D :MATH]
is uniformly distributed at null (Supplementary Note [254]4). This
enables us to define a similarly constructed statistic
[MATH: yc
:MATH]
modelable by the Irwin-Hall distribution as in the case of recurrent de
novos (Eq. [255]4):
[MATH:
yc=∑j=1
K1−μ<
/mrow>g,v
M,i,vD,iμg
2~IHyc∣K :MATH]
13
where K is the number of compound heterozygous configurations
independently landing in gene g across the cohort, and
[MATH:
vM,j :MATH]
and
[MATH:
vD,j :MATH]
are the maternally and paternally inherited variants in gene g in
individual j. As before, K is approximately Poisson distributed, and
parameter
[MATH: λc
:MATH]
, the expected number of compound heterozygous configurations to land
in gene g, is given by
[MATH: E[K]≡λc
mi>=μg2∑j=1NnM,j
msub>nD,j<
/mrow> :MATH]
14
where
[MATH:
nM,j :MATH]
and
[MATH:
nD,j :MATH]
are the numbers of maternally and paternally inherited rare variants in
an individual j, respectively. We compute
[MATH:
Pr(K)
mrow>=Pois(
K∣λc
) :MATH]
as before.
Finally, we extend this basic test statistic to accommodate 16 compound
heterozygous configuration types as
[MATH:
(tM,tD)∈ :MATH]
{coding SNV, coding indel, intronic SNV, intronic indel}2 and define
[MATH:
yc<
mo>′ :MATH]
and Poisson parameter
[MATH:
λc<
mo>′ :MATH]
accordingly as
[MATH:
yc<
mo>′=∑tM,t
D∑i=1K<
mrow>tM,
mo>tD
msub>1−μg,vM,i,t
mi>M,vD,i,tD<
/mrow>max(μg,tM,μg,<
/mo>tD<
/msub>)2 :MATH]
15
and
[MATH: E[K]≡λ<
mrow>c′=
∑tM,t
Dμg,tM<
/msub>μg,<
/mo>tD<
/msub>∑j=1NnM,t
mi>M,jnD,<
mi>tD,j :MATH]
16
where
[MATH:
Kt<
mi>M,tD :MATH]
is the number of compound heterozygous configurations in gene g across
the cohort where the maternally inherited variant is of type
[MATH: tM
:MATH]
and the paternally inherited variant is of type
[MATH: tD
:MATH]
. Instances where
[MATH:
Kt<
mi>M,tD= 0
:MATH]
are excluded from the above sums. For each variant type t, we scale
[MATH:
μg,t :MATH]
such that
[MATH:
∑g
μg,t= 1 :MATH]
. We compute the probability of
[MATH:
yc<
mo>′ :MATH]
as in Eq. [256]5. Note that inherited homozygous recessive variants
that arose from a single, ancestral mutational event violate the
assumptions of our approach and are excluded (Supplementary
Note [257]4).
Modeling false positive diagnoses
For any gene where the observed number of variants
[MATH: K>E[K] :MATH]
across the cohort, we suspect that there are some true diagnoses in
specific patients as well as some “false positives'' where a randomly
occurring variant in a patient is unrelated to the patient's condition.
We use the binomial distribution parameterized by K independent trials
and probability of success per trial
[MATH: E[K]/N :MATH]
to estimate the proportion of false positive diagnoses for each gene.
Analytical test for individual-level compound heterozygous configuration
Given a set of independent compound heterozygous configurations across
genes in a single individual's genome, we construct a test for the
hypothesis of a monogenic, recessive disorder caused by one of these
compound heterozygous configurations against the null. We assume up to
one compound heterozygous configuration per gene, i.e., for each gene
g,
[MATH:
nM
nD
μg2<
/mrow>≪1 :MATH]
, where
[MATH:
∑g
μg= 1 :MATH]
and
[MATH: nM
:MATH]
and
[MATH: nD
:MATH]
are the numbers of maternally and paternally inherited rare variants in
this individual's genome.
We now rescale the mutational target of a compound heterozygous
configuration (Eq. [258]12) with respect to all genes g in the genome
as
[MATH: μ~g,vM<
/mi>,vD=∑i∈G
min(μ
i2,
μg,
vM,v<
/mi>D)∑i∈G
μi
2. :MATH]
17
Intuitively, this corresponds to the probability of observing a
compound heterozygous configuration with an equal or smaller (i.e.,
more surprising) mutational target occurring in any gene in the genome.
Thus,
[MATH: μ~~U[0,1] :MATH]
. We precompute each gene's compound heterozygous mutational target
[MATH:
μi
2 :MATH]
for all genes in the genome in order to quickly compute
[MATH: μ~
:MATH]
values for each observed compound heterozygous configuration in an
individual.
Next, we define our statistic
[MATH: y~c :MATH]
per individual as the minimal observed rescaled compound heterozygous
mutational target:
[MATH: y~c=min
mi>(μ~1,...
mo>,μ~K) :MATH]
18
We compute the probability of observing a
[MATH: y~c :MATH]
value this low or lower given K total genes with observed compound
heterozygous configurations in an individual's genome as
[MATH: Pr(y~c∣K)
=1−∏i=1KPr(Y>
y~c)=1−Pr(
Y>y~c)
K. :MATH]
19
where Y is a dummy variable. Because
[MATH: y~c :MATH]
is uniformly distributed on [0,1],
[MATH:
Pr(Y>y~c)=(1−y~c) :MATH]
, so we simplify this calculation as
[MATH: Pr(y~c∣K)
=1−(
1−y~c)
K :MATH]
20
We also model the distribution of K observed compound heterozygous
configurations across an individual's genome in order to compute the
overall probability of our statistic
[MATH: y~c :MATH]
using the same formulation as before (Eq. [259]5). The distribution of
K, given our prior assumption of at most one compound heterozygous
configuration per gene, has an exact solution as the number of double
events in a bivariate binomial distribution with correlation parameter
[MATH: ρ :MATH]
capturing the effect of different gene lengths on K. However, due to
the complexity in calculations of the exact solution, here we use the
Poisson approximation instead because, for each gene g,
[MATH:
∑g
μg= 1 :MATH]
and
[MATH:
nM
nD
μg2<
/mrow>≪1 :MATH]
. The
[MATH: λ~c :MATH]
parameter for the Poisson approximation in this case is
[MATH: E[K]≡λ~c=nMn
D∑i=1
Gμi2
:MATH]
21
Finally, we accommodate the 16 compound heterozygous configuration
types as
[MATH:
(tM,tD)∈ :MATH]
{coding SNV, coding indel, intronic SNV, intronic indel}2 and redefine
[MATH: μ~′ :MATH]
,
[MATH: y~c′
mrow> :MATH]
and Poisson parameter
[MATH: λ~c′
mrow> :MATH]
accordingly as
[MATH: μ~g,vM<
/mi>,vD′=∑tM,t
D∑i∈G
min(μi,tM<
/mi>μ
i,tD,μg,vM,tM,vD,tD)∑tM,t
D∑i∈G
μi,tM<
mrow>μi,t<
/mrow>D
:MATH]
22
and
[MATH: y~c′<
mo>=min(μ~1′,.
..,μ~K′)
:MATH]
23
and
[MATH: E[K]≡λ~c′
mrow>=∑tM,t
DnM,tM<
/msub>nD,<
/mo>tD<
/msub>∑i∈G
μi,tM<
mrow>μi,t<
/mrow>D :MATH]
24
Enrichment for correct diagnoses
Given a ranked list of genes across a cohort of patients, where each
gene may be diagnostic for the given patient, we can compute enrichment
for correct diagnoses at each gene rank. We use Fisher's exact test to
compare the proportion of complete, certain diagnoses in all genes up
to and including rank k compared to the proportion of correct diagnoses
at genes ranked k+1 through the end of the list. We consider the
minimum Fisher's exact test P across all k to be our overall
enrichment. We assign a permutation-based P value to this enrichment
value by randomly permuting the initial gene list 10,000 times and
recomputing the minimum Fisher's exact test P for each permuted list.
Transcriptome sequencing analysis for MED11
RNA extraction, sequencing and quality control
RNA was extracted from UDN patients' whole blood samples received at
UCLA between 2018 and 2019 using PAXgene Blood RNA extraction kits from
Qiagen. Concentration of RNA in each sample was quantified using the
Qubit 3.0 Fluorometer. RNA integrity numbers (RINs), a quality control
measure, were assessed per sample using the Agilent bioanalyzer. RNA
libraries were prepared for each sample using either the NuGEN
Universal Plus mRNA-Seq kit or the Illumina TruSeq mRNA + Globin Minus
kit. Sequencing was then performed on the Illumina NovaSeq 6000 to
generate ~50-100 million 100-150bp paired-end reads per sample. Library
preparation and sequencing were performed at the UCLA Neuroscience
Genomics Core and the UCLA Technology Center for Genomics and
Bioinformatics Core. Sequenced reads in FASTQ format were aligned to
human reference genome GRCh37 using STAR v2.5.2b with default
parameters and Gencode v19 annotations^[260]95,[261]96. To increase
sensitivity to novel splice junctions, reads were mapped using the STAR
2-pass mode, where novel splice junctions detected during the first
pass alignment are indexed and used alongside known splice junctions in
the second pass remapping. We confirmed effective ribosomal RNA (rRNA)
depletion per sample by aligning all paired-end reads to the complete
sequences for nuclear and mitochondrial rRNAs using BWA-mem
v0.7.17^[262]97 and ensured that the proportion of aligned reads did
not suggest excessive rRNA contamination. Duplicate reads were marked
using PicardTools v4.2.4.0 and post-alignment sequencing quality was
assessed using RNA-SeQC v1.1.8 to ensure adequate library
complexity^[263]98,[264]99. RNA sample identity was confirmed by
comparing single nucleotide variant (SNV) calls from RNA sequencing to
SNV calls from corresponding exome or genome sequencing data per
sample.
Intron retention outlier analysis
Fifty-three tissue-matched control samples from UDN participants
unrelated to the proband, mother and father were selected for outlier
analysis. IRFinder v1.2.456 was run in BAM mode using the same human
reference GRCh37 and Gencode v19 annotations on aligned BAM files to
measure the level of intron retention (i.e., “IRratio'') in MED11
across the proband, mother, father, and control samples. The IRratio is
computed per sample as (median read depth of first intron) / (number of
reads spanning first and second exons + median read depth of first
intron). Aligned reads covering the MED11 gene region (i.e.,
chr17:4,634,723-4,636,903) from the proband, mother, father and two
control samples were viewed using a local installation of the
Integrative Genomics Viewer (IGV) v2.16.0^[265]100.
Pathway enrichment analysis
Phenotypically-similar patient groupings
Phrank (version v2018-12-13) was used to compute all-against-all
pairwise phenotype similarity scores between all affected patients'
sets of standardized HPO terms. We normalized these scores by dividing
by the maximum self-similarity score in each pair^[266]101. UDN
patients experience a spectrum of symptoms across overlapping
biological categories and therefore cannot be easily separated into
distinct, well-defined clusters (Supplementary Fig. [267]17). We
iteratively grouped similar patient pairs using complete-linkage
hierarchical clustering with the agnes function from R's cluster
package (version v2.1.4), which allows for patient groups of different
sizes while minimizing the maximum distance between any two patients in
the same cluster. We assigned patients to clusters by cutting the
resulting dendrogram at height=3.5, resulting in 120 clusters of 2--97
patients per cluster (mean=22, median=17).
Selecting genes per patient cluster
We identify genes per patient cluster as follows. First, we consider
known diagnoses for all UDN patients in that cluster. For patients with
a diagnosis that was “complete'' (i.e., explained all symptoms
including asserted phenotypes), no further genes are considered. For
patients with no diagnosis or at most one “partial'' diagnosis, we then
consider genes with an exonic (or intronic with a SpliceAI score >0.4)
de novo variant and assign each gene its variant's severity weight (s)
from our modified DeNovoWEST procedure. Recall that weights are
assigned per variant class based on functional impact (e.g.,
frameshift, nonsense, missense), variant deleteriousness, and gene
constraint. Autosomal de novos are considered as before in addition to
de novos on chromosome X with gnomAD population maximum allele
frequency < 0.0001, TOPMed allele frequency <0.0001, proband alternate
allele read depth >20 and frequency >0.2 (for females) and alternate
read depth >20 in both parents. We consider the most significant gene
per patient with
[MATH:
Pr(s)
mrow><0.0005 :MATH]
, where
[MATH:
Pr(s)
mrow>=Pr(S
≥s∣K=1)Pr(K=
1) :MATH]
.
[MATH:
Pr(S≥s∣K=1)
:MATH]
is computed exactly using precomputed per-variant Roulette mutation
rates and variant weights per gene. Pr(K=1) is computed assuming
mutations follow a Poisson distribution with
[MATH:
λ=μge<
/mi>ne :MATH]
for genes falling on chromosome X in males and
[MATH: λ=
2μgen
e :MATH]
otherwise. Finally, in patients who still have fewer than two genes at
this point and no complete diagnoses, we include up to one additional
gene harboring a compound heterozygous variant pair that ranked in the
top 100 in our RaMeDiES-IND cohort-wide per-individual analysis, as
there was significant enrichment for correct diagnoses in this set
(Fisher's exact test P = 5.23e-4). Across all patient clusters, we
considered 70 genes (6.15%) with de novo variants corresponding to
known diagnoses, 10 genes (0.88%) with compound heterozygous variants
corresponding to known diagnoses, 562 (49.38%) other known diagnostic
genes, 434 genes (38.14%) with new de novo candidates, and 62 genes
(5.45%) with new compound heterozygous candidates.
Gene set enrichment analysis (GSEA)
The genes found across all patients in each patient cluster were used
as a query set for gene set enrichment analysis (GSEA) using g:Profiler
(version e108_eg55_p17)^[268]102. We considered Reactome and KEGG
biological pathway gene sets of size <150 genes and set our background
gene set to all human genes annotated in Ensembl. Enrichment P values
are adjusted using g:Profiler’s g:SCS approach^[269]103. Briefly, for
every query gene set size, 2000 random gene sets of the same size are
used as queries for GSEA with the same parameters, and the lowest
pathway enrichment P value is recorded for each random query set. A
threshold t is selected for each query gene set size as the 5% quantile
of these random minimum P values. Enrichment P values resulting from
the true gene query are then adjusted by multiplying by 0.05/t.
Reporting summary
Further information on research design is available in the [270]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[271]Supplementary Information^ (8.3MB, pdf)
[272]41467_2025_61712_MOESM2_ESM.pdf^ (126.6KB, pdf)
Description of Additional Supplementary Files
[273]Supplementary Data 1-9^ (3.7MB, xlsx)
[274]Reporting Summary^ (3.4MB, pdf)
[275]Transparent Peer Review file^ (25.5KB, pdf)
Acknowledgements