Abstract

   Effective research and clinical application in audiology and hearing
   loss (HL) require the integration of diverse data, yet the absence of a
   dedicated database impedes understanding and insight extraction in HL.
   To address this, the Genetic Deafness Commons (GDC) is developed by
   consolidating extensive genetic and genomic data from 51 public
   databases and the Chinese Deafness Genetics Consortium. This repository
   comprises 5 983 613 variants across 201 HL genes, revealing the genetic
   landscape of HL and identifying six novel mutational hotspots within
   the DNA‐binding domains of transcription factors. Comparative
   phenotypic analyses highlighted considerable disparities between human
   and mouse models. Among the 201 human HL genes, 133 exhibit hearing
   abnormalities in mice; 35 have been tested in mice without exhibiting a
   hearing loss phenotype; and 33 lack auditory testing data. Moreover,
   gene expression analyses in the cochleae of mice, humans, and rhesus
   macaques demonstrated a notable correlation (R^2 0.718–0.752).
   Utilizing gene expression, function, pathway, and phenotype data, a
   SMOTE‐Random Forest model identified 18 candidate HL genes, including
   TBX2, newly confirmed as an HL gene. As a comprehensive and unified
   repository, the GDC advances audiology research and practice by
   improving data accessibility and usability, ultimately fostering deeper
   insights into hearing disorders.

   Keywords: database, gene expression, hearing loss, mouse phenotype,
   machine learning
     __________________________________________________________________

   Overview of the Genetic Deafness Commons (GDC), integrating data from
   the Chinese Deafness Genetics Consortium (CDGC) and 51 public
   databases. The GDC provides tools for variant search, functional
   predictions, and gene‐disease visualization, offering insights into 201
   hearing loss genes and facilitating novel gene discovery and clinical
   applications.

   graphic file with name ADVS-12-2408891-g005.jpg

1. Introduction

   Hearing loss (HL) is the most common sensory deficit and one of the
   most common congenital abnormalities. Estimates indicate that among
   every 1000 newborns screened, 1.1–3.5 will have HL.^[ [46]^1 , [47]^2
   ^] The etiology of HL is multifactorial, encompassing genetic defects,
   physical trauma, infections, drug toxicity, noise exposure, and aging,
   among other factors.^[ [48]^3 , [49]^4 , [50]^5 ^] Genetic
   predispositions play a pivotal role in congenital and early‐onset HL,
   characterized by significant genetic and phenotypic heterogeneity, with
   more than 200 genes identified thus far.^[ [51]^6 , [52]^7 , [53]^8 ^]

   Current research and clinical applications in audiology and hearing
   disorders, such as novel HL gene identification, auditory mechanism
   exploration, research on the auditory system development, variant
   interpretation, and genetic diagnosis, and the development of gene
   therapies, hold the promise of translating individual genomic data into
   clinically relevant information to aid in disease diagnostics and
   facilitate personalized therapeutic decision making.^[ [54]^9 , [55]^10
   , [56]^11 , [57]^12 , [58]^13 , [59]^14 , [60]^15 ^] These efforts
   depend critically on the integration of vast data and knowledge
   accumulated across numerous databases and repositories. This
   integration includes gene/region‐level annotations detailing genomic
   features (e.g., UCSC genome browsers^[ [61]^16 ^] and ENSEMBL^[ [62]^17
   ^]), transcriptional information (e.g., gEAR^[ [63]^18 ^] and
   Genotype‐Tissue Expression (GTEx)^[ [64]^19 ^]), gene/protein functions
   (e.g., UniProtKB^[ [65]^20 ^] and Gene Ontology (GO)^[ [66]^21 ^]) and
   structures (e.g., InterPro^[ [67]^22 ^] and PDB^[ [68]^23 ^]), pathway
   information (e.g., Reactome^[ [69]^24 ^] and Kyoto Encyclopedia of
   Genes and Genomes (KEGG)^[ [70]^25 ^]), gene‐gene interactions (e.g.,
   STRING^[ [71]^26 ^] and BioGrid^[ [72]^27 ^]), and gene‐drug
   interactions (e.g., PharmGKB^[ [73]^28 ^] and DGIdb^[ [74]^29 ^]).
   Variant annotations also play a pivotal role in interpreting the
   effects of molecular processes and disease causation, including
   population frequencies (e.g., gnomAD,^[ [75]^30 ^] ChinaMAP,^[ [76]^31
   ^] and BBJ^[ [77]^32 ^]), genotype‐phenotype correlations and
   pathogenicity classifications (e.g., ClinVar,^[ [78]^33 ^] HGMD,^[
   [79]^34 ^] and Deafness Variation Database (DVD)^[ [80]^6 ^]), in
   silico function predictions (e.g., CADD,^[ [81]^35 ^] REVEL,^[ [82]^36
   ^] and SpliceAI^[ [83]^37 ^]), and automated literature mining (e.g.,
   LitVar^[ [84]^38 ^] and Variant2Literature^[ [85]^39 ^]). Additionally,
   disease and phenotype information extracted from literature through
   expert curation includes standardized disease/phenotype descriptions
   (e.g., Human Phenotype Ontology (HPO)^[ [86]^40 ^] and MedlinePlus^[
   [87]^41 ^]), disease‐gene correlation (e.g., OMIM^[ [88]^42 ^] and
   ClinGen^[ [89]^43 ^]), and animal models (e.g., Mouse Genome
   Informatics (MGI)^[ [90]^44 ^]). Together, these diverse databases
   provide essential support for researchers to explore the complexities
   of genetic and molecular mechanisms underlying audiology and hearing
   disorders, offering crucial insights into broader biological processes,
   disease pathogenesis, and potential therapeutic interventions.

   However, the lack of a dedicated, comprehensive database for hearing
   research has significantly hindered the compilation and integration of
   these resources. Existing hearing‐related databases like DVD and gEAR
   cater to niche aspects of hearing loss research, with DVD focusing on
   variant classification within HL genes and gEAR on gene expression in
   the cochlea. Information is often scattered across specialized catalogs
   tailored to specific fields, particular model organisms, or specific
   techniques, resulting in heterogeneous data vocabularies, ontologies,
   and formats. It complicates efforts to fully reconcile and integrate
   the data, posing significant challenges in understanding the landscape
   of hearing loss, identifying and prioritizing relevant information, and
   consequently impeding the extraction of meaningful insights.

   To bridge the gap, we established Genetic Deafness Commons (GDC,
   [91]http://gdcdb.net/), a standardized database and knowledge base that
   comprehensively consolidates and integrates genetic and genomic data
   from both public and in‐house sources. The GDC leveraged over 51 public
   databases to provide extensive information on HL genes, variants, and
   phenotypes. Additionally, it integrates genetic findings from 22 125 HL
   cases from the Chinese Deafness Genetics Consortium (CDGC), offering
   real‐world patient cohort data to support variant interpretation and
   curation. Utilizing the extensive dataset of GDC, this study conducts a
   thorough analysis of the genetic architecture of HL genes and variants,
   uncovering multiple new patterns to improve the efficacy of variant
   pathogenicity interpretation and genetic diagnosis. By integrating
   public gene expression data from mice and in‐house data from rhesus
   macaques, this study applied the machine learning method to identify a
   set of new candidate HL genes, demonstrating the significant value of
   GDC in deciphering the genetic and molecular foundations of audiology
   and hearing‐related conditions.

2. Results

2.1. Overview of GDC

   The current version of the GDC amalgamated data from 51 public sources
   and the CDGC project, incorporating 5 983 613 variants across 201
   reported HL genes (Figure [92] 1 and Table [93]S1, Supporting
   Information). GDC encompasses detailed disease information, gene
   expression patterns, annotations related to gene functions and
   pathways, protein structures and domains, gene‐gene and gene‐drug
   interactions, as well as other essential data. Additionally, GDC
   introduces multiple specialized annotations to enhance understanding of
   each variant, including allele frequencies from 14 populations of five
   large genome sequencing studies, variant pathogenicity classifications
   from four sources, 20 prediction scores for functional damage,
   genotype‐phenotype correlations, and variant‐based literature mining
   results. Comprehensive data sources are detailed in Table [94]S2
   (Supporting Information) to support further exploration. Furthermore,
   GDC offers a user‐friendly query interface to ensure that information
   is readily accessible in various formats such as graphical
   representations, tables, and downloadable files, facilitating easy
   search and navigation for researchers investigating any HL‐related
   variants and genes.

Figure 1.

   Figure 1
   [95]Open in a new tab

   Construction of the Genetic Deafness Commons (GDC). The data from 51
   public databases and the Chinese Deafness Genetics Consortium were
   collected. A series of data processing, including data alignment, gene
   and variant annotation, format transformation, and lifting over between
   human reference genomes GRCh37 and GRCh38, were applied to integrate
   and harmonize data in GDC. All data on GDC can be accessed online or
   downloaded. All the above logos are from official databases.
   Illustrations of humans, mice, and rhesus monkeys were created with
   [96]Biorender.com.

   Of the 201 HL genes in the GDC, 139 were associated with non‐syndromic
   HL (NSHL) and 93 with syndromic HL (SHL), linking to 92 different
   disorders including Usher syndrome, Alport syndrome, and Waardenburg
   syndrome (Figure [97] 2A). Notably, 31 genes were implicated in both
   NSHL and SHL. In terms of inheritance patterns, 95 genes were linked to
   autosomal dominant (AD), 128 to autosomal recessive (AR), eight to
   X‐linked (XL), and one to mitochondrial (MT) inheritance patterns.
   Furthermore, 31 genes exhibit both AD and AR inheritance patterns.
   These genes are annotated with 7656 functional terms (with a median of
   27 terms per gene) and 8695 phenotypic terms (with a median of 19 terms
   per gene).

Figure 2.

   Figure 2
   [98]Open in a new tab

   Summary of genetic and genomic information in GDC. A) Summarization of
   201 HL genes. The outer circle is the genomic location of 201 genes.
   Genes located on different chromosomes are distinguished by color.
   Inner circles were inheritance patterns, syndromic HL or not,
   statistics of variants, GO term, HPO term, and gene‐gene interactions,
   respectively. B) The total number and proportion of variants observed
   classified by genomic location. Variants in coding regions were further
   classified by functional consequence. C) Developmental time points of
   gene expression data from cochleae of mice and rhesus macaques that
   were integrated by GDC.

   Within the exons, introns, and 1 kb upstream and downstream regions of
   the 201 HL genes, 5983613 genetic variations were collated in the GDC
   from multiple sources including CDGC (v202407, n = 217 915), gnomAD
   (v3.1, n = 4 973 216), ChinaMAP (v1, n = 997 781), DVD (v9, n =
   2 100 721), ClinVar (20 230 702, n = 121 439), and HGMD (2023v2, n =
   24 470). Of all variants, 4.63% were located in exonic and adjacent (±8
   bp) intronic regions. Missense variants constitute the majority of such
   variants at 61.74%. The next most prevalent types are synonymous
   variants (27.5%) followed by indels (5.63% frameshift and 1.98%
   in‐frame), stop gain (2.98%), 5′ UTR (0.38%), canonical splice sites
   (±2 bp of an intron, 0.12%), splice regions (±3‐8 bp of an intron,
   0.39%), 3′ UTR (1.92%), up/downstream regions (1.68%), and start/stop
   loss (<0.2%) (Figure [99]2B). Most variants were extremely rare, with
   the GDC dataset containing 1 436 357 (24%) variants not reported in
   gnomAD. Additionally, 2 338 525 (39.06%) variants in the GDC dataset
   are singletons in gnomAD, with doubletons, tripletons, and quadruplets
   accounting for 10.72%, 5.15%, and 3.08% respectively, and 639113
   variants (10.67%) have a minor allele frequency (MAF) < 0.0002 and
   allele count (AC) > 4.

   Furthermore, GDC incorporated extensive gene expression data from both
   human and model animals. This included bulk RNA‐seq data of 54 human
   tissues sourced from the GTEx database, single‐cell RNA‐seq data at 12
   developmental stages of the mouse cochlea from five public repositories
   ([100]GSE172110, [101]GSE182202, [102]GSE181454, [103]GSE202920, and
   CRA004814)^[ [104]^45 , [105]^46 , [106]^47 , [107]^48 , [108]^49 ^]
   and in‐house bulk RNA‐seq data from three adult rhesus macaque cochlea
   (Figure [109]2C). These datasets enable the GDC to provide a detailed
   and dynamic view of gene expression across different species and
   developmental stages, significantly enhancing the potential for
   discoveries in audiology and hearing disorders.

2.2. Variant Pathogenicity Classification and Curation

   Across the CDGC, DVD, ClinVar, and HGMD databases, 38.18% (n =
   2 284 692) of all GDC variants were classified as pathogenic (P),
   likely pathogenic (LP), benign (B), likely benign (LB), or variants
   with uncertain significance (VUS). A total of 35 702 variants were
   reported as P/LP by at least one source (Figure [110] 3A). Among these
   variants, 16 244 (45.5%) were consistently classified as P/LP across
   two or more sources, whereas 11 607 (32.51%) P/LP variants were
   reported in only one source. Conversely, 7851 (21.99%) variants showed
   medically significant classification conflicts, shifting between P/LP
   and B/LB, or VUS, indicating the need for further validation using more
   extensive patient data (Figure [111]3B). Among the databases, HGMD
   exhibited the highest incidence of 1040 conflicts between P/LP and B/LB
   classifications, as shown in Figure [112]3C and Figure [113]S1
   (Supporting Information). Remarkably, 640 variants previously
   classified as “P/LP” (“DM” or “DM?”) in HGMD were reclassified to B/LB
   by CDGC. This reclassification accounted for 467 variants due to an MAF
   > 0.5% for AR or MAF > 0.1% for AD, 134 variants due to relatively high
   MAF combined with benign computational predictions, 35 AD‐associated
   variants detected in multiple CDGC controls, and 4 variants based on
   other combined evidence. This highlights the importance and efficacy of
   incorporating data from diverse and underrepresented populations in
   variant interpretation.

Figure 3.

   Figure 3
   [114]Open in a new tab

   Variant pathogenicity classification from varied sources. A) Shared and
   unique P/LP variants collected from DVD, ClinVar, HGMD, and CDGC
   datasets. B) Comparative overview of variant classification in GDC and
   the consistency among the databases. C) P/LP and B/LB conflicts across
   DVD, ClinVar, HGMD, and CDGC. Bar plot on left side shows total number
   of conflicted variant classifications for each dataset.

2.3. Analyzing Pathogenic Variants for Insights into Gene Function and
Tolerance

   Protein truncating variants (PTVs), including stop‐gain, frameshift,
   start loss, stop loss, and canonical splice site changes, are expected
   to result in a complete loss of function (LoF) of the affected
   transcripts and are considered potentially deleterious.^[ [115]^50 ^]
   Within the protein‐coding genes included in the GDC, the count of PTVs
   ranged from 6 to 1886. A correlation between PTV count and coding
   region size was observed (R^2 = 0.72, Figure [116]S2 and Table [117]S3,
   Supporting Information). However, genes such as AIFM1, MAP1B, and USP48
   exhibited a significantly lower PTV ratio, suggesting an intolerance to
   PTVs in these genes. A total of 31558 PTVs were classified by CDGC,
   DVD, ClinVar, or HGMD. Notably, the majority of the PTVs classified as
   VUS (n = 6785) were contributed by DVD (77.47%). Analysis of the
   variant types among P/LP classifications for each gene revealed that
   PTVs constitute more than half of P/LP variants in 127 genes (67.91%)
   (Figure [118] 4A). All P/LP variants in the gene EPS8 (n = 13), SYNE4
   (n = 12), MPZL2 (n = 11), GRXCR2 (n = 5), BDP1 (n = 3), GAS2 (n = 2),
   CLDN9 (n = 1), and CD164 (n = 1) were PTVs, indicating that LoF is
   likely the disease mechanism in these genes. In 52 genes (27.96%), more
   than half of the P/LP variants were missense (Figure [119]4A).
   Specifically, in genes like DIABLO, ELMOD3, IFNLR1, MYO1C, P2RX2,
   PDE1C, PLS1, POLR1B, S1PR2, SCD5, THOC1, and WBP2, all P/LP variants
   were missense. By literature review, a gain of function (GoF) mechanism
   was only reported in DIABLO,^[ [120]^51 ^] suggesting further
   exploration of GoF as the probable disease mechanism across these
   genes. By analyzing the rates of P/LP variants among PTVs for each
   gene, the distribution of genes across different inheritance patterns
   is visualized (Figure [121]4B). Genes with an established LoF
   mechanism, such as POU4F3, CHD7, MITF, EYA1, and EYA4, were located in
   a quadrant where both the P/LP ratio of PTVs and the PTV ratio of P/LP
   variants exceed 50% (Figure [122]S3, Supporting Information).
   Conversely, genes tolerant to LoF, such as DIAPH3 and COCH, are found
   in a quadrant where both ratios are below 50%. It is noteworthy that
   genes with potentially lethal LoF variants, like ACTG1 and AIFM1, also
   fall into this category.

Figure 4.

   Figure 4
   [123]Open in a new tab

   Genetic landscape of HL‐associated genes. A) Variant types of P/LP
   variants across 201 HL genes. The barplot above indicates the number of
   variants. Abbreviations: AD, Autosomal dominant; AR, Autosomal
   recessive; XL, X‐linked; PTV, Protein Truncating Variant. B)
   Pathogenicity classification of PTVs across 201 HL genes. C)
   Distribution of rare types of pathogenic variants, including deep
   intronic, synonymous, splice region, start loss, stop loss, UTRs, and
   intergenic. D) Distribution of PTVs in genes enriched with NMD‐escape
   PTVs. The orange circle indicates that the variants are within the last
   exon or the final 50 base pairs of the penultimate exon.

   In addition, rare types of pathogenic variants were observed in
   multiple genes (Figure [124]4C). For instance, pathogenic splicing
   variants are prevalent in genes such as COL1A1, FGFR2, COL4A5, and
   USH2A. Loss‐of‐start and loss‐of‐stop variants are typically found in
   genes FGFR3, COL1A1, and NDP. Furthermore, pathogenic variants
   affecting upstream regions, untranslated regions (UTRs), or downstream
   non‐coding regions are observed in genes COL4A6, TOP2B, and HARS1.

   In certain genes, MARVELD2, MYH9, MYH14, CABP2, CLIC5, HOMER2,
   SPATA5L1, and WFS1, pathogenic PTVs were abundantly identified within
   the last exon or the final 50 base pairs of the penultimate exon
   (Figure [125]4D; Figure [126]S4, Supporting Information), regions
   predicted to potentially escape nonsense‐mediated mRNA decay (NMD). The
   pathogenicity of 3′‐end PTVs in these genes requires careful
   evaluation, given their potentially significant impact on protein
   activity and stability.

2.4. Identification of “Hotspots” for Pathogenic Missense Variants

   We further evaluate the distribution of missense P/LP variants using
   Gaussian kernel density estimation, overlapping with protein domains to
   identify hotspots of disease‐related missense mutations. By integrating
   various sources, we identified 8787 missense P/LP variants in 201 HL
   genes (ranging from 1 to 785 per gene). Significant enrichment of P/LP
   missense variants was predominantly found in transcription factor
   genes, including GATA3 ([127]NP_002042: p.264‐342), GRHL2
   ([128]NP_079191: p.213‐438), LMX1A ([129]NP_796372: p.196‐252), MITF
   ([130]NP_000239: p.198‐288), PAX3 ([131]NP_852124: p.34‐159;
   p.220‐276), POU3F4 ([132]NP_000298: p.186‐340), POU4F3 ([133]NP_002691:
   p.179‐333), REST ([134]NP_001350382: p.159‐412), SIX1 ([135]NP_005973:
   p.127‐181), and SOX10 ([136]NP_008872: p.103‐172) (Table [137]S4,
   Supporting Information). Remarkably, these enriched regions were all
   DNA‐binding domains (Figure [138] 5A; Figure [139]S5, Supporting
   Information). The exceptions in transcription factor genes were FOXI1,
   SIX5, and SNAI2, likely due to the limited number of identified P/LP
   variants. We calculated the positive likelihood ratio for a missense
   variant being P/LP in these hotspot regions. The lower boundary of the
   95% confidence interval of the positive likelihood ratio (LR+_LB)
   ranged from 0.26 to 9.75 (Figure [140]5B). Compared to the thresholds
   for pathogenic evidence as suggested by Tavtigian et al.,^[ [141]^52 ^]
   hotspots in PAX3, SOX10, and GATA3 met the moderate level (LR+_LB
   > 4.3) of the American College of Medical Genetics and Genomics (ACMG)
   and the Association for Molecular Pathology (AMP) pathogenic evidence.
   LMX1A, POU3F4, and MITF fulfilled the supporting level (LR+_LB > 2.08)
   for pathogenicity according to the same criteria. Additionally,
   missense variants of KCNQ1 and KCNQ4 were significantly enriched in the
   ion transport domain (LR+_LB = 2.08) and P‐loop domain (LR+_LB = 9.78),
   respectively, consistent with previous reports.^[ [142]^53 ^] No
   significant missense variant enrichment was observed in other genes.

Figure 5.

   Figure 5
   [143]Open in a new tab

   Identification of gene domains for enrichment of pathogenic missense
   variants. A) Distribution of P/LP missense variants across genes. Red
   curves are the density for P/LP variants reported in ClinVar, HGMD,
   DVD, and CDGC cohort. Blue curves are for variants identified in the
   gnomAD population. The horizontal bar represents protein domains, with
   the orange color indicating DNA‐binding domains. Abbreviations: ZnF,
   Zinc finger; CP2, Grh/CP2 DNA‐binding (DB) domain profile; TAD,
   Transactivation Domin; bHLH‐LZ, Basic helix‐loop‐helix, Leucine‐zipper;
   PD, “Paired box” domain; HD, Homeodomain; POU‐S, POU‐specific domain;
   POU‐H, POU‐homeodomain; SIX1_SD, Transcriptional regulator, SIX1,
   N‐terminal SD domain; DIM, Dimerization; HMG, High mobility group box;
   K2, Context‐dependent transactivation (SOX E group conserved) domain;
   TA, Transactivation. B) Statistical analysis of likelihood ratio (LR)
   and odds ratio (OR) for a missense variant being P/LP in these domains.

2.5. Discrepancies of HL Phenotype Between Human and Mouse Models

   Phenotypic analysis utilizing model organisms, predominantly mice, is
   crucial for identifying disease genes and elucidating underlying
   mechanisms. Nevertheless, phenotypes associated with homologous genes
   often exhibit substantial differences between humans and mice.^[
   [144]^54 , [145]^55 , [146]^56 ^] By reviewing the mouse phenotype
   database and literature, we categorized mouse phenotypes into three
   groups: “HL”, “No HL Evidence,” and “No Auditory Testing”. The
   classification of mouse phenotypes and the related annotations, along
   with references, can be found in Table [147]S5 (Supporting