Abstract Effective research and clinical application in audiology and hearing loss (HL) require the integration of diverse data, yet the absence of a dedicated database impedes understanding and insight extraction in HL. To address this, the Genetic Deafness Commons (GDC) is developed by consolidating extensive genetic and genomic data from 51 public databases and the Chinese Deafness Genetics Consortium. This repository comprises 5 983 613 variants across 201 HL genes, revealing the genetic landscape of HL and identifying six novel mutational hotspots within the DNA‐binding domains of transcription factors. Comparative phenotypic analyses highlighted considerable disparities between human and mouse models. Among the 201 human HL genes, 133 exhibit hearing abnormalities in mice; 35 have been tested in mice without exhibiting a hearing loss phenotype; and 33 lack auditory testing data. Moreover, gene expression analyses in the cochleae of mice, humans, and rhesus macaques demonstrated a notable correlation (R^2 0.718–0.752). Utilizing gene expression, function, pathway, and phenotype data, a SMOTE‐Random Forest model identified 18 candidate HL genes, including TBX2, newly confirmed as an HL gene. As a comprehensive and unified repository, the GDC advances audiology research and practice by improving data accessibility and usability, ultimately fostering deeper insights into hearing disorders. Keywords: database, gene expression, hearing loss, mouse phenotype, machine learning __________________________________________________________________ Overview of the Genetic Deafness Commons (GDC), integrating data from the Chinese Deafness Genetics Consortium (CDGC) and 51 public databases. The GDC provides tools for variant search, functional predictions, and gene‐disease visualization, offering insights into 201 hearing loss genes and facilitating novel gene discovery and clinical applications. graphic file with name ADVS-12-2408891-g005.jpg 1. Introduction Hearing loss (HL) is the most common sensory deficit and one of the most common congenital abnormalities. Estimates indicate that among every 1000 newborns screened, 1.1–3.5 will have HL.^[ [46]^1 , [47]^2 ^] The etiology of HL is multifactorial, encompassing genetic defects, physical trauma, infections, drug toxicity, noise exposure, and aging, among other factors.^[ [48]^3 , [49]^4 , [50]^5 ^] Genetic predispositions play a pivotal role in congenital and early‐onset HL, characterized by significant genetic and phenotypic heterogeneity, with more than 200 genes identified thus far.^[ [51]^6 , [52]^7 , [53]^8 ^] Current research and clinical applications in audiology and hearing disorders, such as novel HL gene identification, auditory mechanism exploration, research on the auditory system development, variant interpretation, and genetic diagnosis, and the development of gene therapies, hold the promise of translating individual genomic data into clinically relevant information to aid in disease diagnostics and facilitate personalized therapeutic decision making.^[ [54]^9 , [55]^10 , [56]^11 , [57]^12 , [58]^13 , [59]^14 , [60]^15 ^] These efforts depend critically on the integration of vast data and knowledge accumulated across numerous databases and repositories. This integration includes gene/region‐level annotations detailing genomic features (e.g., UCSC genome browsers^[ [61]^16 ^] and ENSEMBL^[ [62]^17 ^]), transcriptional information (e.g., gEAR^[ [63]^18 ^] and Genotype‐Tissue Expression (GTEx)^[ [64]^19 ^]), gene/protein functions (e.g., UniProtKB^[ [65]^20 ^] and Gene Ontology (GO)^[ [66]^21 ^]) and structures (e.g., InterPro^[ [67]^22 ^] and PDB^[ [68]^23 ^]), pathway information (e.g., Reactome^[ [69]^24 ^] and Kyoto Encyclopedia of Genes and Genomes (KEGG)^[ [70]^25 ^]), gene‐gene interactions (e.g., STRING^[ [71]^26 ^] and BioGrid^[ [72]^27 ^]), and gene‐drug interactions (e.g., PharmGKB^[ [73]^28 ^] and DGIdb^[ [74]^29 ^]). Variant annotations also play a pivotal role in interpreting the effects of molecular processes and disease causation, including population frequencies (e.g., gnomAD,^[ [75]^30 ^] ChinaMAP,^[ [76]^31 ^] and BBJ^[ [77]^32 ^]), genotype‐phenotype correlations and pathogenicity classifications (e.g., ClinVar,^[ [78]^33 ^] HGMD,^[ [79]^34 ^] and Deafness Variation Database (DVD)^[ [80]^6 ^]), in silico function predictions (e.g., CADD,^[ [81]^35 ^] REVEL,^[ [82]^36 ^] and SpliceAI^[ [83]^37 ^]), and automated literature mining (e.g., LitVar^[ [84]^38 ^] and Variant2Literature^[ [85]^39 ^]). Additionally, disease and phenotype information extracted from literature through expert curation includes standardized disease/phenotype descriptions (e.g., Human Phenotype Ontology (HPO)^[ [86]^40 ^] and MedlinePlus^[ [87]^41 ^]), disease‐gene correlation (e.g., OMIM^[ [88]^42 ^] and ClinGen^[ [89]^43 ^]), and animal models (e.g., Mouse Genome Informatics (MGI)^[ [90]^44 ^]). Together, these diverse databases provide essential support for researchers to explore the complexities of genetic and molecular mechanisms underlying audiology and hearing disorders, offering crucial insights into broader biological processes, disease pathogenesis, and potential therapeutic interventions. However, the lack of a dedicated, comprehensive database for hearing research has significantly hindered the compilation and integration of these resources. Existing hearing‐related databases like DVD and gEAR cater to niche aspects of hearing loss research, with DVD focusing on variant classification within HL genes and gEAR on gene expression in the cochlea. Information is often scattered across specialized catalogs tailored to specific fields, particular model organisms, or specific techniques, resulting in heterogeneous data vocabularies, ontologies, and formats. It complicates efforts to fully reconcile and integrate the data, posing significant challenges in understanding the landscape of hearing loss, identifying and prioritizing relevant information, and consequently impeding the extraction of meaningful insights. To bridge the gap, we established Genetic Deafness Commons (GDC, [91]http://gdcdb.net/), a standardized database and knowledge base that comprehensively consolidates and integrates genetic and genomic data from both public and in‐house sources. The GDC leveraged over 51 public databases to provide extensive information on HL genes, variants, and phenotypes. Additionally, it integrates genetic findings from 22 125 HL cases from the Chinese Deafness Genetics Consortium (CDGC), offering real‐world patient cohort data to support variant interpretation and curation. Utilizing the extensive dataset of GDC, this study conducts a thorough analysis of the genetic architecture of HL genes and variants, uncovering multiple new patterns to improve the efficacy of variant pathogenicity interpretation and genetic diagnosis. By integrating public gene expression data from mice and in‐house data from rhesus macaques, this study applied the machine learning method to identify a set of new candidate HL genes, demonstrating the significant value of GDC in deciphering the genetic and molecular foundations of audiology and hearing‐related conditions. 2. Results 2.1. Overview of GDC The current version of the GDC amalgamated data from 51 public sources and the CDGC project, incorporating 5 983 613 variants across 201 reported HL genes (Figure [92] 1 and Table [93]S1, Supporting Information). GDC encompasses detailed disease information, gene expression patterns, annotations related to gene functions and pathways, protein structures and domains, gene‐gene and gene‐drug interactions, as well as other essential data. Additionally, GDC introduces multiple specialized annotations to enhance understanding of each variant, including allele frequencies from 14 populations of five large genome sequencing studies, variant pathogenicity classifications from four sources, 20 prediction scores for functional damage, genotype‐phenotype correlations, and variant‐based literature mining results. Comprehensive data sources are detailed in Table [94]S2 (Supporting Information) to support further exploration. Furthermore, GDC offers a user‐friendly query interface to ensure that information is readily accessible in various formats such as graphical representations, tables, and downloadable files, facilitating easy search and navigation for researchers investigating any HL‐related variants and genes. Figure 1. Figure 1 [95]Open in a new tab Construction of the Genetic Deafness Commons (GDC). The data from 51 public databases and the Chinese Deafness Genetics Consortium were collected. A series of data processing, including data alignment, gene and variant annotation, format transformation, and lifting over between human reference genomes GRCh37 and GRCh38, were applied to integrate and harmonize data in GDC. All data on GDC can be accessed online or downloaded. All the above logos are from official databases. Illustrations of humans, mice, and rhesus monkeys were created with [96]Biorender.com. Of the 201 HL genes in the GDC, 139 were associated with non‐syndromic HL (NSHL) and 93 with syndromic HL (SHL), linking to 92 different disorders including Usher syndrome, Alport syndrome, and Waardenburg syndrome (Figure [97] 2A). Notably, 31 genes were implicated in both NSHL and SHL. In terms of inheritance patterns, 95 genes were linked to autosomal dominant (AD), 128 to autosomal recessive (AR), eight to X‐linked (XL), and one to mitochondrial (MT) inheritance patterns. Furthermore, 31 genes exhibit both AD and AR inheritance patterns. These genes are annotated with 7656 functional terms (with a median of 27 terms per gene) and 8695 phenotypic terms (with a median of 19 terms per gene). Figure 2. Figure 2 [98]Open in a new tab Summary of genetic and genomic information in GDC. A) Summarization of 201 HL genes. The outer circle is the genomic location of 201 genes. Genes located on different chromosomes are distinguished by color. Inner circles were inheritance patterns, syndromic HL or not, statistics of variants, GO term, HPO term, and gene‐gene interactions, respectively. B) The total number and proportion of variants observed classified by genomic location. Variants in coding regions were further classified by functional consequence. C) Developmental time points of gene expression data from cochleae of mice and rhesus macaques that were integrated by GDC. Within the exons, introns, and 1 kb upstream and downstream regions of the 201 HL genes, 5983613 genetic variations were collated in the GDC from multiple sources including CDGC (v202407, n = 217 915), gnomAD (v3.1, n = 4 973 216), ChinaMAP (v1, n = 997 781), DVD (v9, n = 2 100 721), ClinVar (20 230 702, n = 121 439), and HGMD (2023v2, n = 24 470). Of all variants, 4.63% were located in exonic and adjacent (±8 bp) intronic regions. Missense variants constitute the majority of such variants at 61.74%. The next most prevalent types are synonymous variants (27.5%) followed by indels (5.63% frameshift and 1.98% in‐frame), stop gain (2.98%), 5′ UTR (0.38%), canonical splice sites (±2 bp of an intron, 0.12%), splice regions (±3‐8 bp of an intron, 0.39%), 3′ UTR (1.92%), up/downstream regions (1.68%), and start/stop loss (<0.2%) (Figure [99]2B). Most variants were extremely rare, with the GDC dataset containing 1 436 357 (24%) variants not reported in gnomAD. Additionally, 2 338 525 (39.06%) variants in the GDC dataset are singletons in gnomAD, with doubletons, tripletons, and quadruplets accounting for 10.72%, 5.15%, and 3.08% respectively, and 639113 variants (10.67%) have a minor allele frequency (MAF) < 0.0002 and allele count (AC) > 4. Furthermore, GDC incorporated extensive gene expression data from both human and model animals. This included bulk RNA‐seq data of 54 human tissues sourced from the GTEx database, single‐cell RNA‐seq data at 12 developmental stages of the mouse cochlea from five public repositories ([100]GSE172110, [101]GSE182202, [102]GSE181454, [103]GSE202920, and CRA004814)^[ [104]^45 , [105]^46 , [106]^47 , [107]^48 , [108]^49 ^] and in‐house bulk RNA‐seq data from three adult rhesus macaque cochlea (Figure [109]2C). These datasets enable the GDC to provide a detailed and dynamic view of gene expression across different species and developmental stages, significantly enhancing the potential for discoveries in audiology and hearing disorders. 2.2. Variant Pathogenicity Classification and Curation Across the CDGC, DVD, ClinVar, and HGMD databases, 38.18% (n = 2 284 692) of all GDC variants were classified as pathogenic (P), likely pathogenic (LP), benign (B), likely benign (LB), or variants with uncertain significance (VUS). A total of 35 702 variants were reported as P/LP by at least one source (Figure [110] 3A). Among these variants, 16 244 (45.5%) were consistently classified as P/LP across two or more sources, whereas 11 607 (32.51%) P/LP variants were reported in only one source. Conversely, 7851 (21.99%) variants showed medically significant classification conflicts, shifting between P/LP and B/LB, or VUS, indicating the need for further validation using more extensive patient data (Figure [111]3B). Among the databases, HGMD exhibited the highest incidence of 1040 conflicts between P/LP and B/LB classifications, as shown in Figure [112]3C and Figure [113]S1 (Supporting Information). Remarkably, 640 variants previously classified as “P/LP” (“DM” or “DM?”) in HGMD were reclassified to B/LB by CDGC. This reclassification accounted for 467 variants due to an MAF > 0.5% for AR or MAF > 0.1% for AD, 134 variants due to relatively high MAF combined with benign computational predictions, 35 AD‐associated variants detected in multiple CDGC controls, and 4 variants based on other combined evidence. This highlights the importance and efficacy of incorporating data from diverse and underrepresented populations in variant interpretation. Figure 3. Figure 3 [114]Open in a new tab Variant pathogenicity classification from varied sources. A) Shared and unique P/LP variants collected from DVD, ClinVar, HGMD, and CDGC datasets. B) Comparative overview of variant classification in GDC and the consistency among the databases. C) P/LP and B/LB conflicts across DVD, ClinVar, HGMD, and CDGC. Bar plot on left side shows total number of conflicted variant classifications for each dataset. 2.3. Analyzing Pathogenic Variants for Insights into Gene Function and Tolerance Protein truncating variants (PTVs), including stop‐gain, frameshift, start loss, stop loss, and canonical splice site changes, are expected to result in a complete loss of function (LoF) of the affected transcripts and are considered potentially deleterious.^[ [115]^50 ^] Within the protein‐coding genes included in the GDC, the count of PTVs ranged from 6 to 1886. A correlation between PTV count and coding region size was observed (R^2 = 0.72, Figure [116]S2 and Table [117]S3, Supporting Information). However, genes such as AIFM1, MAP1B, and USP48 exhibited a significantly lower PTV ratio, suggesting an intolerance to PTVs in these genes. A total of 31558 PTVs were classified by CDGC, DVD, ClinVar, or HGMD. Notably, the majority of the PTVs classified as VUS (n = 6785) were contributed by DVD (77.47%). Analysis of the variant types among P/LP classifications for each gene revealed that PTVs constitute more than half of P/LP variants in 127 genes (67.91%) (Figure [118] 4A). All P/LP variants in the gene EPS8 (n = 13), SYNE4 (n = 12), MPZL2 (n = 11), GRXCR2 (n = 5), BDP1 (n = 3), GAS2 (n = 2), CLDN9 (n = 1), and CD164 (n = 1) were PTVs, indicating that LoF is likely the disease mechanism in these genes. In 52 genes (27.96%), more than half of the P/LP variants were missense (Figure [119]4A). Specifically, in genes like DIABLO, ELMOD3, IFNLR1, MYO1C, P2RX2, PDE1C, PLS1, POLR1B, S1PR2, SCD5, THOC1, and WBP2, all P/LP variants were missense. By literature review, a gain of function (GoF) mechanism was only reported in DIABLO,^[ [120]^51 ^] suggesting further exploration of GoF as the probable disease mechanism across these genes. By analyzing the rates of P/LP variants among PTVs for each gene, the distribution of genes across different inheritance patterns is visualized (Figure [121]4B). Genes with an established LoF mechanism, such as POU4F3, CHD7, MITF, EYA1, and EYA4, were located in a quadrant where both the P/LP ratio of PTVs and the PTV ratio of P/LP variants exceed 50% (Figure [122]S3, Supporting Information). Conversely, genes tolerant to LoF, such as DIAPH3 and COCH, are found in a quadrant where both ratios are below 50%. It is noteworthy that genes with potentially lethal LoF variants, like ACTG1 and AIFM1, also fall into this category. Figure 4. Figure 4 [123]Open in a new tab Genetic landscape of HL‐associated genes. A) Variant types of P/LP variants across 201 HL genes. The barplot above indicates the number of variants. Abbreviations: AD, Autosomal dominant; AR, Autosomal recessive; XL, X‐linked; PTV, Protein Truncating Variant. B) Pathogenicity classification of PTVs across 201 HL genes. C) Distribution of rare types of pathogenic variants, including deep intronic, synonymous, splice region, start loss, stop loss, UTRs, and intergenic. D) Distribution of PTVs in genes enriched with NMD‐escape PTVs. The orange circle indicates that the variants are within the last exon or the final 50 base pairs of the penultimate exon. In addition, rare types of pathogenic variants were observed in multiple genes (Figure [124]4C). For instance, pathogenic splicing variants are prevalent in genes such as COL1A1, FGFR2, COL4A5, and USH2A. Loss‐of‐start and loss‐of‐stop variants are typically found in genes FGFR3, COL1A1, and NDP. Furthermore, pathogenic variants affecting upstream regions, untranslated regions (UTRs), or downstream non‐coding regions are observed in genes COL4A6, TOP2B, and HARS1. In certain genes, MARVELD2, MYH9, MYH14, CABP2, CLIC5, HOMER2, SPATA5L1, and WFS1, pathogenic PTVs were abundantly identified within the last exon or the final 50 base pairs of the penultimate exon (Figure [125]4D; Figure [126]S4, Supporting Information), regions predicted to potentially escape nonsense‐mediated mRNA decay (NMD). The pathogenicity of 3′‐end PTVs in these genes requires careful evaluation, given their potentially significant impact on protein activity and stability. 2.4. Identification of “Hotspots” for Pathogenic Missense Variants We further evaluate the distribution of missense P/LP variants using Gaussian kernel density estimation, overlapping with protein domains to identify hotspots of disease‐related missense mutations. By integrating various sources, we identified 8787 missense P/LP variants in 201 HL genes (ranging from 1 to 785 per gene). Significant enrichment of P/LP missense variants was predominantly found in transcription factor genes, including GATA3 ([127]NP_002042: p.264‐342), GRHL2 ([128]NP_079191: p.213‐438), LMX1A ([129]NP_796372: p.196‐252), MITF ([130]NP_000239: p.198‐288), PAX3 ([131]NP_852124: p.34‐159; p.220‐276), POU3F4 ([132]NP_000298: p.186‐340), POU4F3 ([133]NP_002691: p.179‐333), REST ([134]NP_001350382: p.159‐412), SIX1 ([135]NP_005973: p.127‐181), and SOX10 ([136]NP_008872: p.103‐172) (Table [137]S4, Supporting Information). Remarkably, these enriched regions were all DNA‐binding domains (Figure [138] 5A; Figure [139]S5, Supporting Information). The exceptions in transcription factor genes were FOXI1, SIX5, and SNAI2, likely due to the limited number of identified P/LP variants. We calculated the positive likelihood ratio for a missense variant being P/LP in these hotspot regions. The lower boundary of the 95% confidence interval of the positive likelihood ratio (LR+_LB) ranged from 0.26 to 9.75 (Figure [140]5B). Compared to the thresholds for pathogenic evidence as suggested by Tavtigian et al.,^[ [141]^52 ^] hotspots in PAX3, SOX10, and GATA3 met the moderate level (LR+_LB > 4.3) of the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) pathogenic evidence. LMX1A, POU3F4, and MITF fulfilled the supporting level (LR+_LB > 2.08) for pathogenicity according to the same criteria. Additionally, missense variants of KCNQ1 and KCNQ4 were significantly enriched in the ion transport domain (LR+_LB = 2.08) and P‐loop domain (LR+_LB = 9.78), respectively, consistent with previous reports.^[ [142]^53 ^] No significant missense variant enrichment was observed in other genes. Figure 5. Figure 5 [143]Open in a new tab Identification of gene domains for enrichment of pathogenic missense variants. A) Distribution of P/LP missense variants across genes. Red curves are the density for P/LP variants reported in ClinVar, HGMD, DVD, and CDGC cohort. Blue curves are for variants identified in the gnomAD population. The horizontal bar represents protein domains, with the orange color indicating DNA‐binding domains. Abbreviations: ZnF, Zinc finger; CP2, Grh/CP2 DNA‐binding (DB) domain profile; TAD, Transactivation Domin; bHLH‐LZ, Basic helix‐loop‐helix, Leucine‐zipper; PD, “Paired box” domain; HD, Homeodomain; POU‐S, POU‐specific domain; POU‐H, POU‐homeodomain; SIX1_SD, Transcriptional regulator, SIX1, N‐terminal SD domain; DIM, Dimerization; HMG, High mobility group box; K2, Context‐dependent transactivation (SOX E group conserved) domain; TA, Transactivation. B) Statistical analysis of likelihood ratio (LR) and odds ratio (OR) for a missense variant being P/LP in these domains. 2.5. Discrepancies of HL Phenotype Between Human and Mouse Models Phenotypic analysis utilizing model organisms, predominantly mice, is crucial for identifying disease genes and elucidating underlying mechanisms. Nevertheless, phenotypes associated with homologous genes often exhibit substantial differences between humans and mice.^[ [144]^54 , [145]^55 , [146]^56 ^] By reviewing the mouse phenotype database and literature, we categorized mouse phenotypes into three groups: “HL”, “No HL Evidence,” and “No Auditory Testing”. The classification of mouse phenotypes and the related annotations, along with references, can be found in Table [147]S5 (Supporting