Abstract
Objective
Individuals with neurodevelopmental disorders such as global
developmental delay (GDD) present both genotypic and phenotypic
heterogeneity. This diversity has hampered developing of targeted
interventions given the relative rarity of each individual genetic
etiology. Novel approaches to clinical trials where distinct, but
related diseases can be treated by a common drug, known as basket
trials, which have shown benefits in oncology but have yet to be used
in GDD. Nonetheless, it remains unclear how individuals with GDD could
be clustered. Here, we assess two different approaches: agglomerative
and divisive clustering.
Methods
Using the largest cohort of individuals with GDD, which is the
Deciphering Developmental Disorders (DDD), characterized using a
systematic approach, we extracted genotypic and phenotypic information
from 6,588 individuals with GDD. We then used a k-means clustering
(divisive) and hierarchical agglomerative clustering (HAC) to identify
subgroups of individuals. Next, we extracted gene network and molecular
function information with regard to the clusters identified by each
approach.
Results
HAC based on phenotypes identified in individuals with GDD revealed 16
clusters, each presenting with one dominant phenotype displayed by most
individuals in the cluster, along with other minor phenotypes. Among
the most common phenotypes reported were delayed speech, absent speech,
and seizure. Interestingly, each phenotypic cluster molecularly
included several (3–12) gene sub-networks of more closely related genes
with diverse molecular function. k-means clustering also segregated
individuals harboring those phenotypes, but the genetic pathways
identified were different from the ones identified from HAC.
Conclusion
Our study illustrates how divisive (k-means) and agglomerative
clustering can be used in order to group individuals with GDD for
future basket trials. Moreover, the result of our analysis suggests
that phenotypic clusters should be subdivided into molecular
sub-networks for an increased likelihood of successful treatment.
Finally, a combination of both agglomerative and divisive clustering
may be required for developing of a comprehensive treatment.
Keywords: neurodevelopmental differences, global developmental delay,
phenotype, whole exome sequencing, hierarchical agglomerative
clustering, k-means clustering, basket trials
1. Introduction
Neurodevelopmental disorders/difference (NDDs) are a broad group of
disabilities characterized by impairments of personal, social,
academic, or occupational functioning and/or skill development ([42]1).
NDDs affect approximately 18% of the population ([43]2–[44]8) and have
a significant impact on the individual, their family, and society
([45]9, [46]10). Global developmental delay (GDD) is a subtype of NDD
and has a prevalence rate of approximately 3% ([47]11, [48]12). GDD is
diagnosed when an individual under the age of 5 years fails to meet the
expected developmental milestones in two or more domains of
development, such as language, gross or fine motor skills, or social
functioning ([49]13).
Individuals with GDD, as in most NDD, present within a spectrum of
severity ([50]14). Moreover, several genes have now been linked to GDD
([51]15), but each gene individually affects a relatively small number
of individuals, making them known as rare disorders in most cases. The
phenotypic complexity, combined with the genetic heterogeneity and
rarity, has hampered our ability to translate our understanding of the
molecular underpinning of GDD into targeted interventions that are
clinically approved and mechanism-based. Most trials conducted
previously have been focused on translating candidate drugs identified
from pre-clinical investigations on a given gene into individuals with
mutation in that gene. Unfortunately, this approach has been
accompanied by challenges in recruitment, clinical diversity, and a
high number of genes to target.
Fortunately, a novel approach to clinical trials is emerging, known as
basket trials ([52]16), which aims at testing candidate drugs in a
group of disorders related by shared pathophysiology, consequently
improving the cost-effectiveness of the trial. So far, this approach
has been proven to be very productive in oncology, where participants
with different diagnoses but share a common underlying dysregulated
molecular pathway are treated with the same therapeutics ([53]17). In
GDD, individuals could be subgrouped based on their phenotypic or
genotypic profiles ([54]18). Therefore, it is important to gain a
better understanding of how GDD individuals could be clustered.
In general, two approaches have been used in clustering: agglomerative
(also referred to as bottom-up) or divisive (referred to as top-down)
([55]19, [56]20). Hierarchical agglomerative clustering (HAC) aims at
identifying homogeneous groups of individuals based on their phenotypic
profiles, which does not assume a given number of cluster and therefore
can lead to a combination of phenotypes ([57]21, [58]22). HAC has been
widely used due to its ability to detect the natural number of clusters
in a dataset ([59]23–[60]25). On the other hand, a divisive approach
such as k-means clustering ([61]19) requires a set number of clusters
to be established and then assigns individuals to each cluster based on
their similarity ([62]26–[63]28).
Here, we show how these two approaches can be applied to clustering
individuals in the largest cohort of individuals with GDD: the
Deciphering Developmental Disorders (DDD) cohort ([64]29).
2. Materials and methods
2.1. Cohort description
The data included in the DDD dataset were acquired from 24 clinical
genetics centers in the United Kingdom National Health Service and the
Republic of Ireland. A total of 13,462 individuals with undiagnosed
developmental disorders were included in this study. After obtaining
ethics approval at our centers, and with permission from the DDD
consortium, we analyzed the dataset for phenotypes of the individuals.
In DDD, the human phenotype ontology (HPO) is used to describe the
phenotypic information of the individuals ([65]30). HPO contains over
15,000 terms, which describes phenotypic abnormalities and allows the
use of standardized and controlled vocabulary for listing phenotypes
([66]31). We divided the HPO identified in individuals with GDD between
structural (dysmorphic features or congenital malformation) and
functional (affecting behavior or clinical symptoms). We focused this
study on functional HPO, considering our goal to cluster patients for
future interventions.
2.2. Genomic sequence analysis
The exome sequence data of GDD-phenotyped individuals were analyzed in
two stages. In the first stage, the existing GRCh37/hg19 exome sequence
was realigned to the GRCh38 genome reference sequence. Then, short
variant [single nucleotide variants (SNVs) and indels] calling was
performed using the GATK best practices ([67]32), which involves
realigning reads to the GRCh38 reference genome, variant calling using
the HaplotypeCaller and joint genotyping. Finally, variant quality
recalibration and refinement steps were performed, leading to high
quality variant callset. In the second stage, these variants were
annotated for gene information (Ensembl), frequencies (from gnomAD,
ExAC, and internal cohort GDD), and pathogenicity (from CADD, ClinVar,
and ClinGen). The annotated set of variants in the callset were
filtered for gene information, rare variants having minor allele
frequency (MAF) value of ≤1%, impact on the transcript, and
pathogenicity. The details of the annotation and filtering criteria can
be found in Section 1.1 in the [68]Supplementary Material.
2.3. Candidate gene list
We searched PubMed using the keywords intellectual disability (ID) and
global developmental delay (GDD), and compiled a list of genes from
original research and review papers ([69]33–[70]36). We also used genes
listed in databases related to NDDs for diseases and phenotypes, and
even included the genome-wide association studies (GWAS) using the same
keywords [SysID ([71]37, [72]38), DisGenet ([73]39, [74]40), HPO
([75]31, [76]41), OMIM ([77]42, [78]43), Orphanet ([79]44), Phenolyzer
([80]45, [81]46), Ingenuity Pathway Analysis (Qiagen), Open Targets
([82]47, [83]48), AutDB ([84]49)]. We have added the Intellectual
Disability NGS Radboudumc and Fulgent gene panels to achieve the most
complete overview. Each gene list was obtained separately, and then
only those genes that appeared at least three times in the collected
data were retained, resulting in a final list of 2,537 genes (see
[85]Supplementary Table 1).
2.4. Clustering strategies
2.4.1. Hierarchical agglomerative clustering of phenotypes
Among all the HPO-based phenotypes identified in individuals with GDD,
we considered the functional phenotypes (as opposed to morphological
features) for this cluster analysis. Since all the phenotypes can be
treated as binary features and the dissimilarity between two
individuals can be calculated based on their shared and distinct
phenotypes, Jaccard distance was used ([86]50) to measure the
dissimilarity between the individuals, which is calculated as follows:
[MATH: D(Ii,Ij)=1−|Pi∩Pj||Pi∪Pj| :MATH]
where
[MATH: D(Ii,Ij)
:MATH]
is the distance between individual
[MATH: Ii :MATH]
and
[MATH: Ij :MATH]
with set of phenotypes
[MATH: Pi :MATH]
and
[MATH: Pj :MATH]
, respectively. Once the distance matrix representing the Jaccard
dissimilarity between all the individuals is obtained, hierarchical
clustering with ward linkage method is applied. To assess the cluster
validity, we selected the silhouette index (SIL) due to its ability to
provide assessment at the cluster and individual level. SIL is an
internal measure that considers both the intra-cluster and
inter-cluster distances to provide the estimate of compactness and
separateness of the clusters ([87]51). For each data point
[MATH: Ii :MATH]
, the distance to all data points belonging to the same cluster is
calculated, and the average of these distances is referred as
[MATH: Ij :MATH]
. Then, the distance to all data points belonging to other clusters is
calculated, and the smallest of them is referred as
[MATH: bi :MATH]
. The silhouette coefficient for each individual is calculated as
follows:
[MATH:
SILi=(bi−ai)max(ai,bi) :MATH]
Overall clustering results can be assessed by taking the average
silhouette index of all the individual points, and its value ranges
from worst value −1 to best value 1.
2.4.2. Dependency model for genes/phenotype clustering
We performed a divisive clustering approach with k-means using a
dependency model to cluster the genes associated to phenotypes. The
dependency model ([88]Supplementary Figure S1) was created under a
probabilistic framework where the relationship between a given set of
phenotypes (H) and genes (G) is inferred for the affected individuals
(I). This dependency model captures the direct relation among
individuals having distinct GDD phenotypes (solid red line in
[89]Supplementary Figure S1). Similarly, the same affected individuals
carry those rare genetic variants among a set of genes. However, it is
difficult to establish whether these genes cause the individuals to
acquire GDD or because of inheriting these GDD-related traits, the
individuals acquired mutations in these genes. The direction of
causality cannot be established between phenotype and genes; hence,
these can only be inferred probabilistically (represented by a broken
red line in [90]Supplementary Figure S1).
The underlying dependency model facilitates a framework to identify
clusters of genes with respect to a given set of phenotypes.
Mathematically, this can be represented as computing probability
distribution of genes, which is conditional to the phenotypes given by
P(G|H):
[MATH: P(G|H)=∑j=1NP(Ij)P(Ij)P(Ij)∑j=1NP(Ij)P(Ij) :MATH]
Since our task is to compare probability distribution of genes with
respect to a given set of phenotypes, the denominator in the above
equation can be ignored, and it can be simplified as follows:
[MATH: P(G|H)∞∑j=1NP(Ij)P(Ij)P(Ij) :MATH]
This eventually yields a matrix of probability distribution where rows
represent the genes and columns represent the phenotypes. A k-means
clustering was then applied to a subset of the probability matrix,
including the phenotypes of interest ([91]52). The elbow and silhouette
methods were used to determine the optimal number of clusters ([92]53).
To highlight significant differences between phenotypes for each gene
group, a t-test was applied to the clustering results. We corrected the
results obtained using the Bonferroni correction to adjust the p-values
for multiple comparisons. Adjusted p-values less than or equal to 0.05
were considered statistically significant.
2.5. Protein–protein interaction and pathway enrichment analysis
The resulting gene clusters were then analyzed by protein–protein
interaction prediction, clustering, and pathway enrichment using STRING
v11.5 ([93]54). The Markov clustering algorithm (MCL) was used to
identify the gene sub-networks ([94]55). The inflation parameter of MCL
was set to 1.5 ([95]56). The functional enrichment analysis of each
module was also performed in the STRING v11.5 database using the GO
terms and REACTOME pathway. The false discovery rate obtained from the
functional enrichment analysis describes the degree of significance of
the enrichment. The p-values are corrected for multiple testing in each
category using the Benjamini–Hochberg procedure. Visualizations of
protein–protein interactions and clusters were also obtained using
STRING v11.5.
3. Results
3.1. Individuals with GDD present with phenotypic diversity
Among all participants with NDD in the DDD cohort, the individuals with
GDD represent 49.13% (6,588 probands) of the population. We identified
761 physiological phenotypes and 2,158 morphological phenotypes
occurring in this cohort. The phenotypes that are most commonly
represented (>1%) affected multiple systems (neurological,
gastrointestinal, locomotor) in individuals with GDD ([96]Figure 1,
[97]Table 1). The physiological phenotypes that are
neurodevelopmental-related comprised autistic behavior (6.25%), autism
(4.57%), absent speech (AS) (4.36%), aggressive behavior (3.25%), and
stereotypy (4.42%) (including 2.34% of the global term, 2.06% recurrent
hand flapping, 0.18% repetitive compulsive behavior, and 0.17% tongue
thrusting). For other physiological issues, seizures were reported in
21.71% of individuals with GDD ([98]Supplementary Table S2), followed
by hypotonia (15.12%), strabismus (8.76%), joint hypermobility (6.60%),
and sleep disturbance (4.36%). The most frequent systemic phenotypes
involved constipation (5.6%) and gastroesophageal reflux (5.12%).
Figure 1.
[99]Figure 1
[100]Open in a new tab
Most prevalent physiological phenotypes in individuals with GDD using
HPO ontology (frequency >1%). Multiple systems are affected including
neurological, respiratory, gastrointestinal, and locomotor systems.
Created with [101]BioRender.com.
Table 1.
Most prevalent (>1%) physiological phenotypes in individuals with GDD.
Physiological phenotype No. of all GDD % of all GDD
Neurodevelopmental
Delayed speech and language development 1,037 15.74
Autistic behavior 412 6.25
Autism 301 4.57
Absent speech 287 4.36
Specific learning disability 173 2.63
Cognitive impairment 121 1.84
Delayed gross motor development 113 1.72
Attention deficit hyperactivity disorder 106 1.61
Developmental regression 105 1.59
Motor delay 91 1.38
Intellectual disability, moderate 74 1.12
Intellectual disability 69 1.05
Intellectual disability, severe 68 1.03
Other neurological
Seizure 796 12.08
Sleep disturbance 287 4.36
Aggressive behavior 214 3.25
Generalized-onset seizure 178 2.70
Drooling 167 2.53
Stereotypy 154 2.34
Behavioral abnormality 145 2.20
Recurrent hand flapping 136 2.06
Generalized non-motor (absence) seizure 121 1.84
Central hypotonia 113 1.72
Bilateral tonic-clonic seizure 108 1.64
Gait ataxia 107 1.62
Epileptic spasm 93 1.41
Broad-based gait 87 1.32
Febrile seizure (within the age range of 3 months to 6 years) 81 1.23
Generalized myoclonic seizure 75 1.14
Impaired social interactions 73 1.11
Abnormal aggressive, impulsive, or violent behavior 69 1.05
Ataxia 69 1.05
Short attention span 69 1.05
Dysphagia 67 1.02
Eyes
Strabismus 577 8.76
Hypermetropia 205 3.11
Nystagmus 124 1.88
Myopia 121 1.84
Bilateral ptosis 94 1.43
Astigmatism 82 1.24
Visual impairment 78 1.18
Cerebral visual impairment 66 1.00
Ears
Sensorineural hearing impairment 88 1.34
Hearing impairment 87 1.32
Conductive hearing impairment 75 1.14
Cardiovascular system
Pulmonic stenosis 66 1.00
Respiratory system
Recurrent respiratory infections 72 1.09
Digestive system
Constipation 369 5.60
Gastroesophageal reflux 337 5.12
Feeding difficulties in infancy 174 2.64
Feeding difficulties 122 1.85
Gastrostomy tube feeding in infancy 115 1.75
Other
Joint hypermobility 435 6.60
Generalized hypotonia 371 5.63
Muscular hypotonia 281 4.27
Joint laxity 135 2.05
Eczema 123 1.87
Neonatal hypotonia 75 1.14
[102]Open in a new tab
We present the number of individuals with GDD from the DDD dataset
presenting with the phenotypes listed. The phenotypes were structured
as either neurodevelopmental, other neurological, or by target organs
outside the brain.
We also delineated the morphological features observed most frequently
in individuals with GDD ([103]Supplementary Table S3). With regard to
growth parameters, microcephaly was found in 19.74% of individuals.
Macrocephaly was much less common in GDD (5.53%). Short stature was
also frequently reported (5.83%). Cardiac malformations were the most
commonly encountered malformation with a prevalence rate of 7.14%.
Scoliosis was present in 2.58% of individuals.
3.2. Genotypic diversity in individuals with GDD
We identified likely pathogenic and pathogenic variants in the
individuals with GDD using ClinVar ([104]57). From our list of 2,537
candidate causal genes, we found that 1,416 genes were affected by
pathogenic or likely pathogenic variants in individuals with GDD in the
DDD cohort. We found that most genes contained pathogenic variants in
only a small number of individuals, suggesting an important genotypic
diversity ([105]Figure 2, [106]Supplementary Table S1). Indeed, only 86
genes were found in more than 1% of individuals. We also looked at how
many genes from our list were affected by pathogenic/likely pathogenic
variants per individual in the GDD cohort. A total of 3.5% of the GDD
cohort has no pathogenic/likely pathogenic variants from our gene list,
and more than 20% of individuals have three mutated genes
([107]Figure 2).
Figure 2.
[108]Figure 2
[109]Open in a new tab
Analysis of the overall cohort of individuals with GDD from DDD reveals
the genotypic complexity of GDD. (A) Individual GDD/ID candidate genes
are rare, as observed clinically. Using the DDD cohort allows us to
quantify this with only one to five individuals in a cohort of 6,588
individuals with GDD. (B) Individuals with GDD present most of the time
with multiple pathogenic variants in distinct genes. The dotted line
refers to the median at three genes per individual.
3.3. Comparison of clustering approaches
Next, we compared two approaches of clustering: (1) hierarchical
agglomerative where clustering of individuals was based on phenotypes,
and (2) where individuals with and without a phenotype are compared for
their genetic mutations. For each approach, we identified the genes
(from our curated list of GDD/ID genes) harboring likely pathogenic or
pathogenic variants (ClinVar).
* (1)
A hierarchical agglomerative clustering of all individuals with GDD
resulted in 16 distinct clusters (as visualized in the dendrogram
in [110]Figure 3A). The silhouette index of all the individuals is
displayed in [111]Figure 3B. In order to validate the clustering
results, we assessed each cluster separately. Except for cluster
16, all other clusters had a positive silhouette index for majority
of the individuals, indicating that the individuals are clustered
in the correct group. However, in each cluster, some individuals
had a negative silhouette index. The variability in the number of
phenotypes per individual and fewer shared phenotypes among
individuals could possibly lead to the distortion in the coherence
of the clusters (detailed analysis in [112]Supplementary Figure
S2). While we observed more than two phenotypes per cluster, most
individuals presented with GDD + 1 dominant phenotype. Cluster 16,
containing 3,077 individuals, showed no dominant phenotype apart
from the GDD.
We focused on clusters related to speech since 20% of individuals
from the GDD cohort presented with delayed speech to various
degrees. Milder defects were categorized as delayed speech, which
was present in 15.74% of individuals with GDD, and more severe
defects were categorized as absent speech, present in 4.36% of
individuals. HAC Cluster 4 was characterized by delayed speech and
language development (DSLD), which presented in 329 out of 376
individuals in this cluster, while Cluster 14 defined individuals
with absent speech, which presented in134 out of 143 individuals in
this cluster ([113]Figures 4A, [114]5A). Cluster 4 comprised 376
individuals with likely pathogenic or pathogenic variants in 483
candidate genes, while Cluster 14 comprised 143 individuals with
likely pathogenic or pathogenic variants in 303 candidate genes. We
then identified a gene network for each cluster ([115]Figures 4B,
[116]5B). We observed that in both networks, there were
sub-networks (of more closely related genes with different
significantly enriched molecular functions). For DSLD, we
identified 27 sub-networks, with 12 of these sub-networks
containing more than 10 genes with enrichment in the following
pathways: sub-network 1: DNA repair, cell cycle, transcription,
meiosis, and regulation of TP53 activity; sub-network 2: cilium
assembly, visual photo transmission, cargo trafficking, and
Hedgehog signaling; and sub-network 3: cation channel complex,
channel activity, and plasma membrane complex (see
[117]Supplementary Table S4 for all sub-networks). For AS, we
identified 23 subclusters, with eight of these containing more than
10 genes with enrichment in the following pathways: sub-network 1:
DNA repair, cell cycle, and transcriptional regulation; sub-network
2: cilium assembly, anchoring to the basal membrane; and
sub-network 3: interaction between LI and ankyrin, L1CAM (see
[118]Supplementary Table S5 for all sub-networks). This overlap in
gene function enrichment between DSLD and AS was also associated
with 31.7% overlap in genes identified by HAC for each condition
([119]Supplementary Figure S3A). Interestingly, those genes showed
enrichment in more specific pathways associated previously to
cognition ([120]58), such as signal transduction and secondary
messenger, transcription, and chromatin modification
([121]Supplementary Table S6).
* (2)
Next, using the dependency probabilistic models, we determined that
three clusters would be optimal for k-means clustering to see if we
could identify subgroups of individuals with speech defects. First,
we noted that k-means clustering separated individuals with DSLD
and AS with no gene overlap ([122]Supplementary Figure S3B). As in
HAC, k-means clusters subdivided at the gene network level
([123]Figure 6). DSLD had three sub-networks containing more than
10 genes, and there were 24 sub-networks in total; the enrichment
was based on chromatin-modifying enzyme, signal transduction, NOTCH
signaling (sub-network 1); visual phototransduction, recruitment of
mitotic centrosomes, noradrenaline inhibition of insulin, calcium
pathway (sub-network 2); and protein interaction at the synapse,
neurexin and neuregulin, trafficking of GluR2-containing AMPA
receptors (sub-network 3) (see [124]Supplementary Table S7 for all
sub-networks). In AS, only the sub-network 1 had more than 10
genes: oxygen binding (subcluster 1), ribonucleoside synthetic
process (subcluster 2), and amino acid binding (subcluster 3) (see
[125]Supplementary Table S8).
Figure 3.
[126]Figure 3
[127]Open in a new tab
Hierarchical agglomerative clustering using phenotypic profiles of
individuals with GDD. (A) Dendrogram presenting 16 clusters of
individuals. (B) Silhouette plot for all the individuals for cluster
validity analysis.
Figure 4.
[128]Figure 4
[129]Open in a new tab
Gene network from individuals with GDD + DSLD using hierarchical
agglomerative clustering (Cluster 4). (A) Cluster level analysis of the
hierarchical clustering results. (a) Top five phenotypes among the
individuals. (b) Silhouette score for all the individuals and the
average. (c) Distribution of number of phenotypes per individual. (d)
Distribution of shared number of phenotypes among all individual pairs.
For Cluster 4, DSLD is the dominant phenotype. With some individuals
having negative silhouette index, overall cluster level index is
positive, indicating that most of the individuals are in the right
cluster. Similar to Cluster 2, the majority of the individuals have two
phenotypes, but the number of phenotypes per individual ranges up to
14, indicating the phenotypic variability among individuals. (B)
Overall representation. Each color corresponds to a representative
sub-network of more closely related genes.
Figure 5.
[130]Figure 5
[131]Open in a new tab
Gene network for individuals with AS identified by hierarchical
agglomerative clustering (Cluster #14). (A) (a) Top five phenotypes
among the individuals. (b) Silhouette score for all the individuals and
the average. (c) Distribution of number of phenotypes per individual.
(d) Distribution of shared number of phenotypes among all individual
pairs. For Cluster 14, absent speech is the dominating phenotype, and
the average silhouette index for the cluster is 0.058, which is
slightly above zero. Almost half of the individuals have a negative SIL
index, which could possibly be due to the high variability in the
number of phenotypes per individual. As shown in the third plot for
Cluster 14, the number of phenotypes per individual ranges up to 15,
and the majority of the individuals share only two phenotypes, leading
to less similarity among individuals. (B) Overall representation. Each
color corresponds to a representative sub-network of more closely
related genes.
Figure 6.
[132]Figure 6
[133]Open in a new tab
Gene network for individuals with DSLD or AS identified from k-means
clustering. (A) k-means clustering of individuals without speech
disorder, delayed speech, or absent speech. (B) Overall gene network
for individuals with delayed speech. (C) Overall gene network for AS.
Each color corresponds to a representative sub-network of more closely
related genes.
Next, we assessed if genes that were identified for DSLD and AS
overlapped between HAC and k-means clusters. Surprisingly, only 12.7%
and 8.4% of genes overlapped for DSLD and AS between the two clustering
approaches, respectively ([134]Supplementary Figure S4).
In order to assess if these differences in results between clustering
approaches were phenotype-specific, we verified if the phenotype
seizure, which is commonly associated with GDD, would provide similar
results. In HAC, Cluster #2 is characterized by GDD + seizure and
pathogenic/likely pathogenic variants in 510 candidate genes
([135]Figure 7A). Again, at the gene network level, we observed
fragmentation into sub-networks ([136]Figure 7B, [137]Supplementary
Table S9). The MCL clustering found 30 sub-networks, and 13 of them
contained more than 10 genes. Each sub-network involved distinct
pathways: neuronal system, L1 and ankyrin interaction, axon guidance,
synaptic transmission (sub-network 1); transcription regulation, cell
cycle, and DNA repair (subcluster 2); and sensory function, cilium,
vision, and RNA polymerase (subcluster 3) (see [138]Supplementary Table
S9 for all sub-networks).
Figure 7.
[139]Figure 7
[140]Open in a new tab
Genetic makeup and clustering for individuals with GDD and seizure
(Cluster #2). (A) (a) Top five phenotypes among the individuals. (b)
Silhouette score for all the individuals and the average. (c)
Distribution of number of phenotypes per individual. (d) Distribution
of shared number of phenotypes among all individual pairs. For
instance, for Cluster 2 in row 2, the first plot shows that seizure is
the most dominant phenotype; the second plot shows the silhouette index
of all individuals, which is positive for the majority of the
individuals, indicating that the individuals are grouped in the right
cluster. The third plot in row 2 shows that the majority of the
individuals in Cluster 2 has two phenotypes but also ranges up to 10
phenotypes for some of the individuals. The fourth plot shows that most
of the individuals share two phenotypes in this cluster. (B) Overall
gene network for all individuals included in the agglomerative cluster.
The optimal number of k-means clustering for individuals with GDD with
or without seizure was set at two so that only one gene cluster was
identified as being related to seizures ([141]Figure 8A). The gene
network can also be divided into subgroups, and MCL clustering has
highlighted 14 subclusters with only two subclusters having more than
10 genes ([142]Figure 8B). The molecular functions for each cluster
were as follows: subcluster 1: organic and hetero-cyclic compound
binding and transcription regulator activity; subcluster 2: synaptic
function and ion channel and transporter activity; and subcluster 3:
extracellular matrix and glycoprotein (see [143]Supplementary Table S10
for all sub-networks).
Figure 8.
[144]Figure 8
[145]Open in a new tab
Identification of individuals with/without seizure based on
probabilistic genotype–phenotype clustering. (A) Clustering of genes
based on the association with seizure (on the left are genes not
encountered in individuals with GDD and seizure, while on the right are
genes associated with GDD and seizure. Higher probability is marked in
red). (B) Genetic network showing clustering of genes found in
individuals with seizures, revealing the presence of sub-networks.
We then compared the overlap between the two clustering approaches. We
found that 62 genes (11%) overlapped between HAC and k-means
([146]Supplementary Figure S5). Importantly, this degree of overlap was
similar to the overlap identified between the two clustering approaches
when assessing DSLD and AS phenotypes.
4. Discussion
Global developmental delay is a clinical entity encountered commonly in
both general and specialized medical practice, affecting the lives of
approximately 3% of the pediatric population or almost 60 million
children worldwide ([147]59). While some individuals with syndromes
such as Down, Fragile X, Rett, and Angelman have been diagnosed via
targeted gene testing, most individuals remained undiagnosed until the
development of genome-wide tools such as chromosomal microarray (CMA)
and whole exome sequencing (WES) ([148]15). Together, these genome-wide
tools have raised the diagnostic yield in GDD to approximately 60%
([149]15), and have identified approximately 2,000 genes replicated in
at least three studies ([150]Supplementary Table S1). This genetic
complexity has led to challenging targeted interventions. Novel
approaches such as basket trials, where one drug is used to treat
distinct but related conditions, have shown the promises of precision
medicine ([151]15–[152]17) for those with GDD, but the best approach to
clustering individuals remains unknown.
We leveraged the largest dataset of individuals with GDD, which is the
DDD ([153]29, [154]30), and tested various clustering approaches. By
including 6,588 individuals with GDD, we were able to identify clear
clusters based on phenotypic proximity. We considered the possibility
of combining similar or closely related categories into a single main
category. However, we decided to maintain the specificity of each HPO
term in order to avoid bias in the analyses. Individuals were annotated
by their clinician, and we wanted to take advantage of the opportunity
to accurately capture the various manifestations of GDD. This accurate
phenotypic characterization has enhanced the robustness of our
clustering analysis and has facilitated the identification of relevant
gene clusters. In our study, we also wanted to use a pragmatic
approach, reproducing what would be done when planning a
pharmacological intervention in individuals with GDD, so that
individuals would be included based on their GDD genetic diagnosis, and
not based on a phenotype-specific genotype; hence, the focus on their
GDD gene diagnosis.
Since there is an interest (and need due to their high number of genes
and low individual prevalence) in combining individuals with GDD into a
“basket” targeted by the same drug (one drug-multiple targets), we
wanted to assess how this could be achieved. We clustered based on
phenotype, following two of the most common approach: divisive based on
presence or absence of a phenotype or agglomerative, which does not
assume a given number of clusters and therefore can lead to a
combination of phenotypes. When comparing clustering approaches, we
found that hierarchical agglomerative clustering, an approach where
individuals sharing features are grouped together, could identify
bigger clusters of genes, but was less precise in segregating genes
between related conditions (for instance, delayed speech and language
development compared with absent speech). On the other hand, k-means
clustering provided more distinct groups of genes but identified fewer
genes per phenotype. Importantly, the two methods showed overlap in
approximately 10% of the genes they identified. There is a limited
overlap in the clusters identified between the two methods. We
postulate that this relates to the difference in approach (divisive vs.
agglomerative). In the divisive approach, the groups are divided based
on the presence or absence of a specific phenotype. We think that this
approach leads to a smaller but more stringent set of causative genes.
The agglomerative approach identifies phenotypic clusters with a
predominant phenotype (but with the inclusion of other phenotypes as
well. This may lead to the identification of a different genetic makeup
(genes with pleiotropic effect for instance). We believe that our work
will point out that the method used to identify individuals in a future
treatment trial using a basket trial approach should consider how
participants are grouped. Also, for both approaches of clustering, our
results have shown consistently that the genetic makeup of a relatively
homogenous phenotypic cluster is constituted of multiple subclusters.
So for a given basket, it may be possible to understand the response
based on that molecular information. This also shows the importance of
performing such genetic characterization. It is interesting to observe
that while the pathways found in each approach are overall similar, the
number of individuals and their individual genes in each cluster are
somehow different, which is probably due to the “phenotypic purity” of
the cluster.
An important observation in considering future basket trials was that
individuals with GDD harboring the same phenotypes could be further
subdivided based on genomic information into gene network clusters.
This is important due to the reason that individuals within a given
basket for a trial may need to be guided by genetic information and
assigned to different treatment regimen.
These findings also extended to other common comorbidities such as
seizure. Seizures were found to be present in 12.08% of the individuals
with GDD. This finding is similar to what has been reported in autism
spectrum disorder (ASD) ([155]60), but lower than the 56% prevalence
rate of epilepsy in a GDD cohort that was reported recently ([156]61).
Importantly, we observed a similar behavior of both clustering
approaches: HAC identified more genes than the k-means, and that only
approximately 12% of genes overlapped between methods.
It is also important to note that we have made two major filters in
this study: first, filter on a list of candidate genes for cognitive
neuro difference (GDD and ID), and second to select only the variants
in these genes annotated as pathogenic/likely pathogenic in ClinVar.
Therefore, it was expected that the enrichment found for each cluster,
especially for those found with the HAC method (which shows more genes
than the k-means method), would be related to the properties of the
selected set and not necessarily cluster specific. In contrast, the
number of genes enriched in particular molecular functions changes
between clusters; for example, in Cluster 4 (DSLD), DNA binding and
transcription stand out in the first subclusters of the 1,417 ID + GDD
gene set, but the channel activity stands out far in the subclusters of
the 1,417 gene set. This may suggest that ion channels are more related
to DSLD.
Overall, our study provides a rationale for the possibility of having
success with basket trials in the future for drug development in GDD by
showing how large groups of individuals with GDD could be separated
into closely related subgroups. It also shows how different clustering
approaches will influence the size and nature of the cluster.
Furthermore, despite showing shared genetic function, each sub-network
(as opposed to the phenotypic cluster) may need to be considered in
terms of druggability and potential side effects (for DNA or RNA
binding). This highlights the potential importance of genomic
sequencing in pharmaceutical trials. Our findings point to the fact
that it is important to correlate phenotype with not only a single gene
but also take into account the polygenic nature of each individual. It
will be important to understand that phenotypes may not be explainable
by considering a single gene correlation but rather a polygenic
approach and that future work (probably with large sample size) will be
required to assess the correlation between combination of pathogenic
variants and phenotypic presentation. Also, analysis of each mutation
present against the proposed treatment would be important in future
clinical trials.
It should be noted that the progress in clustering includes the
application of deep learning-based methods, which could potentially
complement our research into the genetic basis of GDD. Furthermore, our
future research will be based on the assessment of persons with ASD,
given the features that they share with GDD. By extending our
clustering methods to ASD, we could not only highlight common genetic
factors but also refine targeted interventions, broadening the impact
of our study beyond GDD. Aware of the role of intronic variants, it
will be necessary to integrate whole genome sequencing (WGS) to take
into account the whole genetic background of GDD.
Future in vivo study will be needed to validate which method is most
useful at finding successfully treated clusters. Indeed, it will be
important to use animal models this time and patient-derived cell lines
to validate the response to candidate treatment for genes belonging to
a given cluster. Together, this clustering has the potential to
accelerate access to targeted treatment for individuals with GDD.
Acknowledgments