Abstract

Objective

   Individuals with neurodevelopmental disorders such as global
   developmental delay (GDD) present both genotypic and phenotypic
   heterogeneity. This diversity has hampered developing of targeted
   interventions given the relative rarity of each individual genetic
   etiology. Novel approaches to clinical trials where distinct, but
   related diseases can be treated by a common drug, known as basket
   trials, which have shown benefits in oncology but have yet to be used
   in GDD. Nonetheless, it remains unclear how individuals with GDD could
   be clustered. Here, we assess two different approaches: agglomerative
   and divisive clustering.

Methods

   Using the largest cohort of individuals with GDD, which is the
   Deciphering Developmental Disorders (DDD), characterized using a
   systematic approach, we extracted genotypic and phenotypic information
   from 6,588 individuals with GDD. We then used a k-means clustering
   (divisive) and hierarchical agglomerative clustering (HAC) to identify
   subgroups of individuals. Next, we extracted gene network and molecular
   function information with regard to the clusters identified by each
   approach.

Results

   HAC based on phenotypes identified in individuals with GDD revealed 16
   clusters, each presenting with one dominant phenotype displayed by most
   individuals in the cluster, along with other minor phenotypes. Among
   the most common phenotypes reported were delayed speech, absent speech,
   and seizure. Interestingly, each phenotypic cluster molecularly
   included several (3–12) gene sub-networks of more closely related genes
   with diverse molecular function. k-means clustering also segregated
   individuals harboring those phenotypes, but the genetic pathways
   identified were different from the ones identified from HAC.

Conclusion

   Our study illustrates how divisive (k-means) and agglomerative
   clustering can be used in order to group individuals with GDD for
   future basket trials. Moreover, the result of our analysis suggests
   that phenotypic clusters should be subdivided into molecular
   sub-networks for an increased likelihood of successful treatment.
   Finally, a combination of both agglomerative and divisive clustering
   may be required for developing of a comprehensive treatment.

   Keywords: neurodevelopmental differences, global developmental delay,
   phenotype, whole exome sequencing, hierarchical agglomerative
   clustering, k-means clustering, basket trials

1. Introduction

   Neurodevelopmental disorders/difference (NDDs) are a broad group of
   disabilities characterized by impairments of personal, social,
   academic, or occupational functioning and/or skill development ([42]1).
   NDDs affect approximately 18% of the population ([43]2–[44]8) and have
   a significant impact on the individual, their family, and society
   ([45]9, [46]10). Global developmental delay (GDD) is a subtype of NDD
   and has a prevalence rate of approximately 3% ([47]11, [48]12). GDD is
   diagnosed when an individual under the age of 5 years fails to meet the
   expected developmental milestones in two or more domains of
   development, such as language, gross or fine motor skills, or social
   functioning ([49]13).

   Individuals with GDD, as in most NDD, present within a spectrum of
   severity ([50]14). Moreover, several genes have now been linked to GDD
   ([51]15), but each gene individually affects a relatively small number
   of individuals, making them known as rare disorders in most cases. The
   phenotypic complexity, combined with the genetic heterogeneity and
   rarity, has hampered our ability to translate our understanding of the
   molecular underpinning of GDD into targeted interventions that are
   clinically approved and mechanism-based. Most trials conducted
   previously have been focused on translating candidate drugs identified
   from pre-clinical investigations on a given gene into individuals with
   mutation in that gene. Unfortunately, this approach has been
   accompanied by challenges in recruitment, clinical diversity, and a
   high number of genes to target.

   Fortunately, a novel approach to clinical trials is emerging, known as
   basket trials ([52]16), which aims at testing candidate drugs in a
   group of disorders related by shared pathophysiology, consequently
   improving the cost-effectiveness of the trial. So far, this approach
   has been proven to be very productive in oncology, where participants
   with different diagnoses but share a common underlying dysregulated
   molecular pathway are treated with the same therapeutics ([53]17). In
   GDD, individuals could be subgrouped based on their phenotypic or
   genotypic profiles ([54]18). Therefore, it is important to gain a
   better understanding of how GDD individuals could be clustered.

   In general, two approaches have been used in clustering: agglomerative
   (also referred to as bottom-up) or divisive (referred to as top-down)
   ([55]19, [56]20). Hierarchical agglomerative clustering (HAC) aims at
   identifying homogeneous groups of individuals based on their phenotypic
   profiles, which does not assume a given number of cluster and therefore
   can lead to a combination of phenotypes ([57]21, [58]22). HAC has been
   widely used due to its ability to detect the natural number of clusters
   in a dataset ([59]23–[60]25). On the other hand, a divisive approach
   such as k-means clustering ([61]19) requires a set number of clusters
   to be established and then assigns individuals to each cluster based on
   their similarity ([62]26–[63]28).

   Here, we show how these two approaches can be applied to clustering
   individuals in the largest cohort of individuals with GDD: the
   Deciphering Developmental Disorders (DDD) cohort ([64]29).

2. Materials and methods

2.1. Cohort description

   The data included in the DDD dataset were acquired from 24 clinical
   genetics centers in the United Kingdom National Health Service and the
   Republic of Ireland. A total of 13,462 individuals with undiagnosed
   developmental disorders were included in this study. After obtaining
   ethics approval at our centers, and with permission from the DDD
   consortium, we analyzed the dataset for phenotypes of the individuals.

   In DDD, the human phenotype ontology (HPO) is used to describe the
   phenotypic information of the individuals ([65]30). HPO contains over
   15,000 terms, which describes phenotypic abnormalities and allows the
   use of standardized and controlled vocabulary for listing phenotypes
   ([66]31). We divided the HPO identified in individuals with GDD between
   structural (dysmorphic features or congenital malformation) and
   functional (affecting behavior or clinical symptoms). We focused this
   study on functional HPO, considering our goal to cluster patients for
   future interventions.

2.2. Genomic sequence analysis

   The exome sequence data of GDD-phenotyped individuals were analyzed in
   two stages. In the first stage, the existing GRCh37/hg19 exome sequence
   was realigned to the GRCh38 genome reference sequence. Then, short
   variant [single nucleotide variants (SNVs) and indels] calling was
   performed using the GATK best practices ([67]32), which involves
   realigning reads to the GRCh38 reference genome, variant calling using
   the HaplotypeCaller and joint genotyping. Finally, variant quality
   recalibration and refinement steps were performed, leading to high
   quality variant callset. In the second stage, these variants were
   annotated for gene information (Ensembl), frequencies (from gnomAD,
   ExAC, and internal cohort GDD), and pathogenicity (from CADD, ClinVar,
   and ClinGen). The annotated set of variants in the callset were
   filtered for gene information, rare variants having minor allele
   frequency (MAF) value of ≤1%, impact on the transcript, and
   pathogenicity. The details of the annotation and filtering criteria can
   be found in Section 1.1 in the [68]Supplementary Material.

2.3. Candidate gene list

   We searched PubMed using the keywords intellectual disability (ID) and
   global developmental delay (GDD), and compiled a list of genes from
   original research and review papers ([69]33–[70]36). We also used genes
   listed in databases related to NDDs for diseases and phenotypes, and
   even included the genome-wide association studies (GWAS) using the same
   keywords [SysID ([71]37, [72]38), DisGenet ([73]39, [74]40), HPO
   ([75]31, [76]41), OMIM ([77]42, [78]43), Orphanet ([79]44), Phenolyzer
   ([80]45, [81]46), Ingenuity Pathway Analysis (Qiagen), Open Targets
   ([82]47, [83]48), AutDB ([84]49)]. We have added the Intellectual
   Disability NGS Radboudumc and Fulgent gene panels to achieve the most
   complete overview. Each gene list was obtained separately, and then
   only those genes that appeared at least three times in the collected
   data were retained, resulting in a final list of 2,537 genes (see
   [85]Supplementary Table 1).

2.4. Clustering strategies

2.4.1. Hierarchical agglomerative clustering of phenotypes

   Among all the HPO-based phenotypes identified in individuals with GDD,
   we considered the functional phenotypes (as opposed to morphological
   features) for this cluster analysis. Since all the phenotypes can be
   treated as binary features and the dissimilarity between two
   individuals can be calculated based on their shared and distinct
   phenotypes, Jaccard distance was used ([86]50) to measure the
   dissimilarity between the individuals, which is calculated as follows:
   [MATH: <mi>D</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>i</mi></msub><mo>,</mo
   ><msub><mi>I</mi><mi>j</mi></msub></mrow><mo
   stretchy="false">)</mo><mo>=</mo><mn>1</mn><mo>−</mo><mspace
   width="thickmathspace"></mspace><mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mo fence="false"
   stretchy="false">|</mo><mrow><msub><mi>P</mi><mi>i</mi></msub><mo>∩</mo
   ><msub><mi>P</mi><mi>j</mi></msub></mrow><mo fence="false"
   stretchy="false">|</mo></mrow><mrow><mo fence="false"
   stretchy="false">|</mo><mrow><msub><mi>P</mi><mi>i</mi></msub><mo>∪</mo
   ><msub><mi>P</mi><mi>j</mi></msub></mrow><mo fence="false"
   stretchy="false">|</mo></mrow></mfrac></mrow></mstyle> :MATH]

   where
   [MATH: <mi>D</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>i</mi></msub><mo>,</mo
   ><msub><mi>I</mi><mi>j</mi></msub></mrow><mo stretchy="false">)</mo>
   :MATH]
   is the distance between individual
   [MATH: <msub><mi>I</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>I</mi><mi>j</mi></msub> :MATH]
   with set of phenotypes
   [MATH: <msub><mi>P</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>P</mi><mi>j</mi></msub> :MATH]
   , respectively. Once the distance matrix representing the Jaccard
   dissimilarity between all the individuals is obtained, hierarchical
   clustering with ward linkage method is applied. To assess the cluster
   validity, we selected the silhouette index (SIL) due to its ability to
   provide assessment at the cluster and individual level. SIL is an
   internal measure that considers both the intra-cluster and
   inter-cluster distances to provide the estimate of compactness and
   separateness of the clusters ([87]51). For each data point
   [MATH: <msub><mi>I</mi><mi>i</mi></msub> :MATH]
   , the distance to all data points belonging to the same cluster is
   calculated, and the average of these distances is referred as
   [MATH: <msub><mi>I</mi><mi>j</mi></msub> :MATH]
   . Then, the distance to all data points belonging to other clusters is
   calculated, and the smallest of them is referred as
   [MATH: <msub><mi>b</mi><mi>i</mi></msub> :MATH]
   . The silhouette coefficient for each individual is calculated as
   follows:
   [MATH:
   <mi>S</mi><mi>I</mi><msub><mi>L</mi><mi>i</mi></msub><mo>=</mo><mstyle
   displaystyle="true" scriptlevel="0"><mrow><mfrac><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>−</mo
   ><mspace
   width="thickmathspace"></mspace><msub><mi>a</mi><mi>i</mi></msub></mrow
   ><mo
   stretchy="false">)</mo></mrow><mrow><mi>m</mi><mi>a</mi><mi>x</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>,</mo
   ><msub><mi>b</mi><mi>i</mi></msub></mrow><mo
   stretchy="false">)</mo></mrow></mfrac></mrow></mstyle> :MATH]

   Overall clustering results can be assessed by taking the average
   silhouette index of all the individual points, and its value ranges
   from worst value −1 to best value 1.

2.4.2. Dependency model for genes/phenotype clustering

   We performed a divisive clustering approach with k-means using a
   dependency model to cluster the genes associated to phenotypes. The
   dependency model ([88]Supplementary Figure S1) was created under a
   probabilistic framework where the relationship between a given set of
   phenotypes (H) and genes (G) is inferred for the affected individuals
   (I). This dependency model captures the direct relation among
   individuals having distinct GDD phenotypes (solid red line in
   [89]Supplementary Figure S1). Similarly, the same affected individuals
   carry those rare genetic variants among a set of genes. However, it is
   difficult to establish whether these genes cause the individuals to
   acquire GDD or because of inheriting these GDD-related traits, the
   individuals acquired mutations in these genes. The direction of
   causality cannot be established between phenotype and genes; hence,
   these can only be inferred probabilistically (represented by a broken
   red line in [90]Supplementary Figure S1).

   The underlying dependency model facilitates a framework to identify
   clusters of genes with respect to a given set of phenotypes.
   Mathematically, this can be represented as computing probability
   distribution of genes, which is conditional to the phenotypes given by
   P(G|H):
   [MATH: <mi>P</mi><mo stretchy="false">(</mo><mi>G</mi><mo fence="false"
   stretchy="false">|</mo><mi>H</mi><mo
   stretchy="false">)</mo><mo>=</mo><munderover><mrow><mo
   movablelimits="false">∑</mo></mrow><mrow><mspace
   width="thinmathspace"></mspace><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi
   >N</mi></munderover><mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo></mrow><mrow><msubsup><mrow><mo
   movablelimits="false">∑</mo></mrow><mrow><mspace
   width="thinmathspace"></mspace><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi
   >N</mi></msubsup><mo>⁡</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo></mrow></mfrac></mrow></mstyle> :MATH]

   Since our task is to compare probability distribution of genes with
   respect to a given set of phenotypes, the denominator in the above
   equation can be ignored, and it can be simplified as follows:
   [MATH: <mi>P</mi><mo stretchy="false">(</mo><mi>G</mi><mo fence="false"
   stretchy="false">|</mo><mi>H</mi><mo stretchy="false">)</mo><mi
   mathvariant="normal">∞</mi><munderover><mrow><mo
   movablelimits="false">∑</mo></mrow><mrow><mspace
   width="thinmathspace"></mspace><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi
   >N</mi></munderover><mo>⁡</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo><mi>P</mi><mo
   stretchy="false">(</mo><mrow><msub><mi>I</mi><mi>j</mi></msub></mrow><m
   o stretchy="false">)</mo> :MATH]

   This eventually yields a matrix of probability distribution where rows
   represent the genes and columns represent the phenotypes. A k-means
   clustering was then applied to a subset of the probability matrix,
   including the phenotypes of interest ([91]52). The elbow and silhouette
   methods were used to determine the optimal number of clusters ([92]53).
   To highlight significant differences between phenotypes for each gene
   group, a t-test was applied to the clustering results. We corrected the
   results obtained using the Bonferroni correction to adjust the p-values
   for multiple comparisons. Adjusted p-values less than or equal to 0.05
   were considered statistically significant.

2.5. Protein–protein interaction and pathway enrichment analysis

   The resulting gene clusters were then analyzed by protein–protein
   interaction prediction, clustering, and pathway enrichment using STRING
   v11.5 ([93]54). The Markov clustering algorithm (MCL) was used to
   identify the gene sub-networks ([94]55). The inflation parameter of MCL
   was set to 1.5 ([95]56). The functional enrichment analysis of each
   module was also performed in the STRING v11.5 database using the GO
   terms and REACTOME pathway. The false discovery rate obtained from the
   functional enrichment analysis describes the degree of significance of
   the enrichment. The p-values are corrected for multiple testing in each
   category using the Benjamini–Hochberg procedure. Visualizations of
   protein–protein interactions and clusters were also obtained using
   STRING v11.5.

3. Results

3.1. Individuals with GDD present with phenotypic diversity

   Among all participants with NDD in the DDD cohort, the individuals with
   GDD represent 49.13% (6,588 probands) of the population. We identified
   761 physiological phenotypes and 2,158 morphological phenotypes
   occurring in this cohort. The phenotypes that are most commonly
   represented (>1%) affected multiple systems (neurological,
   gastrointestinal, locomotor) in individuals with GDD ([96]Figure 1,
   [97]Table 1). The physiological phenotypes that are
   neurodevelopmental-related comprised autistic behavior (6.25%), autism
   (4.57%), absent speech (AS) (4.36%), aggressive behavior (3.25%), and
   stereotypy (4.42%) (including 2.34% of the global term, 2.06% recurrent
   hand flapping, 0.18% repetitive compulsive behavior, and 0.17% tongue
   thrusting). For other physiological issues, seizures were reported in
   21.71% of individuals with GDD ([98]Supplementary Table S2), followed
   by hypotonia (15.12%), strabismus (8.76%), joint hypermobility (6.60%),
   and sleep disturbance (4.36%). The most frequent systemic phenotypes
   involved constipation (5.6%) and gastroesophageal reflux (5.12%).

Figure 1.

   [99]Figure 1
   [100]Open in a new tab

   Most prevalent physiological phenotypes in individuals with GDD using
   HPO ontology (frequency >1%). Multiple systems are affected including
   neurological, respiratory, gastrointestinal, and locomotor systems.
   Created with [101]BioRender.com.

Table 1.

   Most prevalent (>1%) physiological phenotypes in individuals with GDD.
   Physiological phenotype No. of all GDD % of all GDD
   Neurodevelopmental
   Delayed speech and language development 1,037 15.74
   Autistic behavior 412 6.25
   Autism 301 4.57
   Absent speech 287 4.36
   Specific learning disability 173 2.63
   Cognitive impairment 121 1.84
   Delayed gross motor development 113 1.72
   Attention deficit hyperactivity disorder 106 1.61
   Developmental regression 105 1.59
   Motor delay 91 1.38
   Intellectual disability, moderate 74 1.12
   Intellectual disability 69 1.05
   Intellectual disability, severe 68 1.03
   Other neurological
   Seizure 796 12.08
   Sleep disturbance 287 4.36
   Aggressive behavior 214 3.25
   Generalized-onset seizure 178 2.70
   Drooling 167 2.53
   Stereotypy 154 2.34
   Behavioral abnormality 145 2.20
   Recurrent hand flapping 136 2.06
   Generalized non-motor (absence) seizure 121 1.84
   Central hypotonia 113 1.72
   Bilateral tonic-clonic seizure 108 1.64
   Gait ataxia 107 1.62
   Epileptic spasm 93 1.41
   Broad-based gait 87 1.32
   Febrile seizure (within the age range of 3 months to 6 years) 81 1.23
   Generalized myoclonic seizure 75 1.14
   Impaired social interactions 73 1.11
   Abnormal aggressive, impulsive, or violent behavior 69 1.05
   Ataxia 69 1.05
   Short attention span 69 1.05
   Dysphagia 67 1.02
   Eyes
   Strabismus 577 8.76
   Hypermetropia 205 3.11
   Nystagmus 124 1.88
   Myopia 121 1.84
   Bilateral ptosis 94 1.43
   Astigmatism 82 1.24
   Visual impairment 78 1.18
   Cerebral visual impairment 66 1.00
   Ears
   Sensorineural hearing impairment 88 1.34
   Hearing impairment 87 1.32
   Conductive hearing impairment 75 1.14
   Cardiovascular system
   Pulmonic stenosis 66 1.00
   Respiratory system
   Recurrent respiratory infections 72 1.09
   Digestive system
   Constipation 369 5.60
   Gastroesophageal reflux 337 5.12
   Feeding difficulties in infancy 174 2.64
   Feeding difficulties 122 1.85
   Gastrostomy tube feeding in infancy 115 1.75
   Other
   Joint hypermobility 435 6.60
   Generalized hypotonia 371 5.63
   Muscular hypotonia 281 4.27
   Joint laxity 135 2.05
   Eczema 123 1.87
   Neonatal hypotonia 75 1.14
   [102]Open in a new tab

   We present the number of individuals with GDD from the DDD dataset
   presenting with the phenotypes listed. The phenotypes were structured
   as either neurodevelopmental, other neurological, or by target organs
   outside the brain.

   We also delineated the morphological features observed most frequently
   in individuals with GDD ([103]Supplementary Table S3). With regard to
   growth parameters, microcephaly was found in 19.74% of individuals.
   Macrocephaly was much less common in GDD (5.53%). Short stature was
   also frequently reported (5.83%). Cardiac malformations were the most
   commonly encountered malformation with a prevalence rate of 7.14%.
   Scoliosis was present in 2.58% of individuals.

3.2. Genotypic diversity in individuals with GDD

   We identified likely pathogenic and pathogenic variants in the
   individuals with GDD using ClinVar ([104]57). From our list of 2,537
   candidate causal genes, we found that 1,416 genes were affected by
   pathogenic or likely pathogenic variants in individuals with GDD in the
   DDD cohort. We found that most genes contained pathogenic variants in
   only a small number of individuals, suggesting an important genotypic
   diversity ([105]Figure 2, [106]Supplementary Table S1). Indeed, only 86
   genes were found in more than 1% of individuals. We also looked at how
   many genes from our list were affected by pathogenic/likely pathogenic
   variants per individual in the GDD cohort. A total of 3.5% of the GDD
   cohort has no pathogenic/likely pathogenic variants from our gene list,
   and more than 20% of individuals have three mutated genes
   ([107]Figure 2).

Figure 2.

   [108]Figure 2
   [109]Open in a new tab

   Analysis of the overall cohort of individuals with GDD from DDD reveals
   the genotypic complexity of GDD. (A) Individual GDD/ID candidate genes
   are rare, as observed clinically. Using the DDD cohort allows us to
   quantify this with only one to five individuals in a cohort of 6,588
   individuals with GDD. (B) Individuals with GDD present most of the time
   with multiple pathogenic variants in distinct genes. The dotted line
   refers to the median at three genes per individual.

3.3. Comparison of clustering approaches

   Next, we compared two approaches of clustering: (1) hierarchical
   agglomerative where clustering of individuals was based on phenotypes,
   and (2) where individuals with and without a phenotype are compared for
   their genetic mutations. For each approach, we identified the genes
   (from our curated list of GDD/ID genes) harboring likely pathogenic or
   pathogenic variants (ClinVar).
     * (1)
       A hierarchical agglomerative clustering of all individuals with GDD
       resulted in 16 distinct clusters (as visualized in the dendrogram
       in [110]Figure 3A). The silhouette index of all the individuals is
       displayed in [111]Figure 3B. In order to validate the clustering
       results, we assessed each cluster separately. Except for cluster
       16, all other clusters had a positive silhouette index for majority
       of the individuals, indicating that the individuals are clustered
       in the correct group. However, in each cluster, some individuals
       had a negative silhouette index. The variability in the number of
       phenotypes per individual and fewer shared phenotypes among
       individuals could possibly lead to the distortion in the coherence
       of the clusters (detailed analysis in [112]Supplementary Figure
       S2). While we observed more than two phenotypes per cluster, most
       individuals presented with GDD + 1 dominant phenotype. Cluster 16,
       containing 3,077 individuals, showed no dominant phenotype apart
       from the GDD.
       We focused on clusters related to speech since 20% of individuals
       from the GDD cohort presented with delayed speech to various
       degrees. Milder defects were categorized as delayed speech, which
       was present in 15.74% of individuals with GDD, and more severe
       defects were categorized as absent speech, present in 4.36% of
       individuals. HAC Cluster 4 was characterized by delayed speech and
       language development (DSLD), which presented in 329 out of 376
       individuals in this cluster, while Cluster 14 defined individuals
       with absent speech, which presented in134 out of 143 individuals in
       this cluster ([113]Figures 4A, [114]5A). Cluster 4 comprised 376
       individuals with likely pathogenic or pathogenic variants in 483
       candidate genes, while Cluster 14 comprised 143 individuals with
       likely pathogenic or pathogenic variants in 303 candidate genes. We
       then identified a gene network for each cluster ([115]Figures 4B,
       [116]5B). We observed that in both networks, there were
       sub-networks (of more closely related genes with different
       significantly enriched molecular functions). For DSLD, we
       identified 27 sub-networks, with 12 of these sub-networks
       containing more than 10 genes with enrichment in the following
       pathways: sub-network 1: DNA repair, cell cycle, transcription,
       meiosis, and regulation of TP53 activity; sub-network 2: cilium
       assembly, visual photo transmission, cargo trafficking, and
       Hedgehog signaling; and sub-network 3: cation channel complex,
       channel activity, and plasma membrane complex (see
       [117]Supplementary Table S4 for all sub-networks). For AS, we
       identified 23 subclusters, with eight of these containing more than
       10 genes with enrichment in the following pathways: sub-network 1:
       DNA repair, cell cycle, and transcriptional regulation; sub-network
       2: cilium assembly, anchoring to the basal membrane; and
       sub-network 3: interaction between LI and ankyrin, L1CAM (see
       [118]Supplementary Table S5 for all sub-networks). This overlap in
       gene function enrichment between DSLD and AS was also associated
       with 31.7% overlap in genes identified by HAC for each condition
       ([119]Supplementary Figure S3A). Interestingly, those genes showed
       enrichment in more specific pathways associated previously to
       cognition ([120]58), such as signal transduction and secondary
       messenger, transcription, and chromatin modification
       ([121]Supplementary Table S6).
     * (2)
       Next, using the dependency probabilistic models, we determined that
       three clusters would be optimal for k-means clustering to see if we
       could identify subgroups of individuals with speech defects. First,
       we noted that k-means clustering separated individuals with DSLD
       and AS with no gene overlap ([122]Supplementary Figure S3B). As in
       HAC, k-means clusters subdivided at the gene network level
       ([123]Figure 6). DSLD had three sub-networks containing more than
       10 genes, and there were 24 sub-networks in total; the enrichment
       was based on chromatin-modifying enzyme, signal transduction, NOTCH
       signaling (sub-network 1); visual phototransduction, recruitment of
       mitotic centrosomes, noradrenaline inhibition of insulin, calcium
       pathway (sub-network 2); and protein interaction at the synapse,
       neurexin and neuregulin, trafficking of GluR2-containing AMPA
       receptors (sub-network 3) (see [124]Supplementary Table S7 for all
       sub-networks). In AS, only the sub-network 1 had more than 10
       genes: oxygen binding (subcluster 1), ribonucleoside synthetic
       process (subcluster 2), and amino acid binding (subcluster 3) (see
       [125]Supplementary Table S8).

Figure 3.

   [126]Figure 3
   [127]Open in a new tab

   Hierarchical agglomerative clustering using phenotypic profiles of
   individuals with GDD. (A) Dendrogram presenting 16 clusters of
   individuals. (B) Silhouette plot for all the individuals for cluster
   validity analysis.

Figure 4.

   [128]Figure 4
   [129]Open in a new tab

   Gene network from individuals with GDD + DSLD using hierarchical
   agglomerative clustering (Cluster 4). (A) Cluster level analysis of the
   hierarchical clustering results. (a) Top five phenotypes among the
   individuals. (b) Silhouette score for all the individuals and the
   average. (c) Distribution of number of phenotypes per individual. (d)
   Distribution of shared number of phenotypes among all individual pairs.
   For Cluster 4, DSLD is the dominant phenotype. With some individuals
   having negative silhouette index, overall cluster level index is
   positive, indicating that most of the individuals are in the right
   cluster. Similar to Cluster 2, the majority of the individuals have two
   phenotypes, but the number of phenotypes per individual ranges up to
   14, indicating the phenotypic variability among individuals. (B)
   Overall representation. Each color corresponds to a representative
   sub-network of more closely related genes.

Figure 5.

   [130]Figure 5
   [131]Open in a new tab

   Gene network for individuals with AS identified by hierarchical
   agglomerative clustering (Cluster #14). (A) (a) Top five phenotypes
   among the individuals. (b) Silhouette score for all the individuals and
   the average. (c) Distribution of number of phenotypes per individual.
   (d) Distribution of shared number of phenotypes among all individual
   pairs. For Cluster 14, absent speech is the dominating phenotype, and
   the average silhouette index for the cluster is 0.058, which is
   slightly above zero. Almost half of the individuals have a negative SIL
   index, which could possibly be due to the high variability in the
   number of phenotypes per individual. As shown in the third plot for
   Cluster 14, the number of phenotypes per individual ranges up to 15,
   and the majority of the individuals share only two phenotypes, leading
   to less similarity among individuals. (B) Overall representation. Each
   color corresponds to a representative sub-network of more closely
   related genes.

Figure 6.

   [132]Figure 6
   [133]Open in a new tab

   Gene network for individuals with DSLD or AS identified from k-means
   clustering. (A) k-means clustering of individuals without speech
   disorder, delayed speech, or absent speech. (B) Overall gene network
   for individuals with delayed speech. (C) Overall gene network for AS.
   Each color corresponds to a representative sub-network of more closely
   related genes.

   Next, we assessed if genes that were identified for DSLD and AS
   overlapped between HAC and k-means clusters. Surprisingly, only 12.7%
   and 8.4% of genes overlapped for DSLD and AS between the two clustering
   approaches, respectively ([134]Supplementary Figure S4).

   In order to assess if these differences in results between clustering
   approaches were phenotype-specific, we verified if the phenotype
   seizure, which is commonly associated with GDD, would provide similar
   results. In HAC, Cluster #2 is characterized by GDD + seizure and
   pathogenic/likely pathogenic variants in 510 candidate genes
   ([135]Figure 7A). Again, at the gene network level, we observed
   fragmentation into sub-networks ([136]Figure 7B, [137]Supplementary
   Table S9). The MCL clustering found 30 sub-networks, and 13 of them
   contained more than 10 genes. Each sub-network involved distinct
   pathways: neuronal system, L1 and ankyrin interaction, axon guidance,
   synaptic transmission (sub-network 1); transcription regulation, cell
   cycle, and DNA repair (subcluster 2); and sensory function, cilium,
   vision, and RNA polymerase (subcluster 3) (see [138]Supplementary Table
   S9 for all sub-networks).

Figure 7.

   [139]Figure 7
   [140]Open in a new tab

   Genetic makeup and clustering for individuals with GDD and seizure
   (Cluster #2). (A) (a) Top five phenotypes among the individuals. (b)
   Silhouette score for all the individuals and the average. (c)
   Distribution of number of phenotypes per individual. (d) Distribution
   of shared number of phenotypes among all individual pairs. For
   instance, for Cluster 2 in row 2, the first plot shows that seizure is
   the most dominant phenotype; the second plot shows the silhouette index
   of all individuals, which is positive for the majority of the
   individuals, indicating that the individuals are grouped in the right
   cluster. The third plot in row 2 shows that the majority of the
   individuals in Cluster 2 has two phenotypes but also ranges up to 10
   phenotypes for some of the individuals. The fourth plot shows that most
   of the individuals share two phenotypes in this cluster. (B) Overall
   gene network for all individuals included in the agglomerative cluster.

   The optimal number of k-means clustering for individuals with GDD with
   or without seizure was set at two so that only one gene cluster was
   identified as being related to seizures ([141]Figure 8A). The gene
   network can also be divided into subgroups, and MCL clustering has
   highlighted 14 subclusters with only two subclusters having more than
   10 genes ([142]Figure 8B). The molecular functions for each cluster
   were as follows: subcluster 1: organic and hetero-cyclic compound
   binding and transcription regulator activity; subcluster 2: synaptic
   function and ion channel and transporter activity; and subcluster 3:
   extracellular matrix and glycoprotein (see [143]Supplementary Table S10
   for all sub-networks).

Figure 8.

   [144]Figure 8
   [145]Open in a new tab

   Identification of individuals with/without seizure based on
   probabilistic genotype–phenotype clustering. (A) Clustering of genes
   based on the association with seizure (on the left are genes not
   encountered in individuals with GDD and seizure, while on the right are
   genes associated with GDD and seizure. Higher probability is marked in
   red). (B) Genetic network showing clustering of genes found in
   individuals with seizures, revealing the presence of sub-networks.

   We then compared the overlap between the two clustering approaches. We
   found that 62 genes (11%) overlapped between HAC and k-means
   ([146]Supplementary Figure S5). Importantly, this degree of overlap was
   similar to the overlap identified between the two clustering approaches
   when assessing DSLD and AS phenotypes.

4. Discussion

   Global developmental delay is a clinical entity encountered commonly in
   both general and specialized medical practice, affecting the lives of
   approximately 3% of the pediatric population or almost 60 million
   children worldwide ([147]59). While some individuals with syndromes
   such as Down, Fragile X, Rett, and Angelman have been diagnosed via
   targeted gene testing, most individuals remained undiagnosed until the
   development of genome-wide tools such as chromosomal microarray (CMA)
   and whole exome sequencing (WES) ([148]15). Together, these genome-wide
   tools have raised the diagnostic yield in GDD to approximately 60%
   ([149]15), and have identified approximately 2,000 genes replicated in
   at least three studies ([150]Supplementary Table S1). This genetic
   complexity has led to challenging targeted interventions. Novel
   approaches such as basket trials, where one drug is used to treat
   distinct but related conditions, have shown the promises of precision
   medicine ([151]15–[152]17) for those with GDD, but the best approach to
   clustering individuals remains unknown.

   We leveraged the largest dataset of individuals with GDD, which is the
   DDD ([153]29, [154]30), and tested various clustering approaches. By
   including 6,588 individuals with GDD, we were able to identify clear
   clusters based on phenotypic proximity. We considered the possibility
   of combining similar or closely related categories into a single main
   category. However, we decided to maintain the specificity of each HPO
   term in order to avoid bias in the analyses. Individuals were annotated
   by their clinician, and we wanted to take advantage of the opportunity
   to accurately capture the various manifestations of GDD. This accurate
   phenotypic characterization has enhanced the robustness of our
   clustering analysis and has facilitated the identification of relevant
   gene clusters. In our study, we also wanted to use a pragmatic
   approach, reproducing what would be done when planning a
   pharmacological intervention in individuals with GDD, so that
   individuals would be included based on their GDD genetic diagnosis, and
   not based on a phenotype-specific genotype; hence, the focus on their
   GDD gene diagnosis.

   Since there is an interest (and need due to their high number of genes
   and low individual prevalence) in combining individuals with GDD into a
   “basket” targeted by the same drug (one drug-multiple targets), we
   wanted to assess how this could be achieved. We clustered based on
   phenotype, following two of the most common approach: divisive based on
   presence or absence of a phenotype or agglomerative, which does not
   assume a given number of clusters and therefore can lead to a
   combination of phenotypes. When comparing clustering approaches, we
   found that hierarchical agglomerative clustering, an approach where
   individuals sharing features are grouped together, could identify
   bigger clusters of genes, but was less precise in segregating genes
   between related conditions (for instance, delayed speech and language
   development compared with absent speech). On the other hand, k-means
   clustering provided more distinct groups of genes but identified fewer
   genes per phenotype. Importantly, the two methods showed overlap in
   approximately 10% of the genes they identified. There is a limited
   overlap in the clusters identified between the two methods. We
   postulate that this relates to the difference in approach (divisive vs.
   agglomerative). In the divisive approach, the groups are divided based
   on the presence or absence of a specific phenotype. We think that this
   approach leads to a smaller but more stringent set of causative genes.
   The agglomerative approach identifies phenotypic clusters with a
   predominant phenotype (but with the inclusion of other phenotypes as
   well. This may lead to the identification of a different genetic makeup
   (genes with pleiotropic effect for instance). We believe that our work
   will point out that the method used to identify individuals in a future
   treatment trial using a basket trial approach should consider how
   participants are grouped. Also, for both approaches of clustering, our
   results have shown consistently that the genetic makeup of a relatively
   homogenous phenotypic cluster is constituted of multiple subclusters.
   So for a given basket, it may be possible to understand the response
   based on that molecular information. This also shows the importance of
   performing such genetic characterization. It is interesting to observe
   that while the pathways found in each approach are overall similar, the
   number of individuals and their individual genes in each cluster are
   somehow different, which is probably due to the “phenotypic purity” of
   the cluster.

   An important observation in considering future basket trials was that
   individuals with GDD harboring the same phenotypes could be further
   subdivided based on genomic information into gene network clusters.
   This is important due to the reason that individuals within a given
   basket for a trial may need to be guided by genetic information and
   assigned to different treatment regimen.

   These findings also extended to other common comorbidities such as
   seizure. Seizures were found to be present in 12.08% of the individuals
   with GDD. This finding is similar to what has been reported in autism
   spectrum disorder (ASD) ([155]60), but lower than the 56% prevalence
   rate of epilepsy in a GDD cohort that was reported recently ([156]61).
   Importantly, we observed a similar behavior of both clustering
   approaches: HAC identified more genes than the k-means, and that only
   approximately 12% of genes overlapped between methods.

   It is also important to note that we have made two major filters in
   this study: first, filter on a list of candidate genes for cognitive
   neuro difference (GDD and ID), and second to select only the variants
   in these genes annotated as pathogenic/likely pathogenic in ClinVar.
   Therefore, it was expected that the enrichment found for each cluster,
   especially for those found with the HAC method (which shows more genes
   than the k-means method), would be related to the properties of the
   selected set and not necessarily cluster specific. In contrast, the
   number of genes enriched in particular molecular functions changes
   between clusters; for example, in Cluster 4 (DSLD), DNA binding and
   transcription stand out in the first subclusters of the 1,417 ID + GDD
   gene set, but the channel activity stands out far in the subclusters of
   the 1,417 gene set. This may suggest that ion channels are more related
   to DSLD.

   Overall, our study provides a rationale for the possibility of having
   success with basket trials in the future for drug development in GDD by
   showing how large groups of individuals with GDD could be separated
   into closely related subgroups. It also shows how different clustering
   approaches will influence the size and nature of the cluster.
   Furthermore, despite showing shared genetic function, each sub-network
   (as opposed to the phenotypic cluster) may need to be considered in
   terms of druggability and potential side effects (for DNA or RNA
   binding). This highlights the potential importance of genomic
   sequencing in pharmaceutical trials. Our findings point to the fact
   that it is important to correlate phenotype with not only a single gene
   but also take into account the polygenic nature of each individual. It
   will be important to understand that phenotypes may not be explainable
   by considering a single gene correlation but rather a polygenic
   approach and that future work (probably with large sample size) will be
   required to assess the correlation between combination of pathogenic
   variants and phenotypic presentation. Also, analysis of each mutation
   present against the proposed treatment would be important in future
   clinical trials.

   It should be noted that the progress in clustering includes the
   application of deep learning-based methods, which could potentially
   complement our research into the genetic basis of GDD. Furthermore, our
   future research will be based on the assessment of persons with ASD,
   given the features that they share with GDD. By extending our
   clustering methods to ASD, we could not only highlight common genetic
   factors but also refine targeted interventions, broadening the impact
   of our study beyond GDD. Aware of the role of intronic variants, it
   will be necessary to integrate whole genome sequencing (WGS) to take
   into account the whole genetic background of GDD.

   Future in vivo study will be needed to validate which method is most
   useful at finding successfully treated clusters. Indeed, it will be
   important to use animal models this time and patient-derived cell lines
   to validate the response to candidate treatment for genes belonging to
   a given cluster. Together, this clustering has the potential to
   accelerate access to targeted treatment for individuals with GDD.

Acknowledgments