Graphical abstract graphic file with name fx1.jpg [25]Open in a new tab Highlights * • A graph-of-graphs approach integrates multi-scale protein information * • Graph neural networks predict both mode of inheritance and functional mechanisms * • Feature attribution reveals biological insights into genetic disease mechanisms * • Models were applied to all autosomal proteins, and predictions are publicly available __________________________________________________________________ molecular interaction; molecular network; neural networks; quantitative genetics Introduction Most human genetic diseases result from variants that disrupt protein function through diverse molecular mechanisms, which play a critical role in determining their mode of inheritance (MOI).[26]^1 In autosomal dominant (AD) disorders, a single copy of a mutated gene can result in disease, often through loss of function (LOF) due to haploinsufficiency (HI), where the remaining wild-type allele cannot compensate for the lost function.[27]^2 Dominant disorders can also result from non-LOF mechanisms, such as gain of function (GOF), where the mutant protein acquires a new or altered function, and the dominant-negative (DN) effect, where the mutant protein interferes with the normal function of the wild-type protein.[28]^3 In contrast, autosomal recessive (AR) disorders require variants in both gene copies, predominantly involving LOF mechanisms, such as missense variants that destabilize protein structure or nonsense variants leading to truncated, non-functional proteins. Previous studies on MOI prediction have introduced computational tools such as DOMINO,[29]^4 which utilizes linear discriminant analysis (LDA) to predict whether a protein is associated with AD disorders by integrating various features such as genomic data, conservation, and protein interactions. MOI-Pred,[30]^5 on the other hand, focuses on variant-level predictions, specifically targeting missense variants associated with AR diseases. More recent research has aimed at predicting the functional impact of variants in specific genes. LoGoFunc combines gene-, protein-, and variant-level features to predict pathogenic GOF, LOF, and neutral variants.[31]^6 Another study explored the structural effects of variants, finding that non-LOF variants tend to have milder impacts on protein structure.[32]^7 Additionally, a recent study employed three support vector machines (SVMs) to predict protein coding genes associated with DN, GOF, and HI mechanisms.[33]^8 In this study, we present a comprehensive approach for predicting the MOI for all proteins encoded by autosomal genes, as well as elucidating the functional effect of variants underlying AD genetic disorders ([34]Figure 1). Our framework combines graph neural networks (GNNs)[35]^9 with structural interactomics by creating a graph of graphs,[36]^10 utilizing both protein-protein interaction (PPI) network and high-resolution protein structures. For MOI prediction, we model proteins as nodes within the PPI network, incorporating topological and protein-level features for classification. For molecular mechanism prediction, we represent each protein as a graph of amino acid residues, leveraging structure-based features to classify the functional effect as HI, GOF, or DN. This integrated approach enables proteome-wide prediction of inheritance patterns and provides mechanistic insights into AD diseases, offering a novel, scalable framework for understanding genetic disorders. Figure 1. [37]Figure 1 [38]Open in a new tab Overview of the study The mode of inheritance (MOI) is first predicted for all autosomal proteins in the protein-protein interaction network. Next, AlphaFold-predicted protein structures are used to generate residue-level graphs for each dominant protein, which are then used to predict functional effects. Figure created with [39]https://www.biorender.com. For the sake of flow and conciseness, we refer to “proteins associated with AD disorders” as AD proteins and “proteins associated with AR disorders” as AR proteins. Similarly, we use DN (GOF/LOF) proteins instead of “proteins associated with DN (GOF/LOF) molecular disease mechanisms.” Results Datasets MOI data We gathered 4,737 MOI-labeled proteins; among them, 2,494 (53%) were only AR, 1,420 (30%) were only AD, and 808 (17%) were both AD and AR ([40]Figure 2, left). Figure 2. [41]Figure 2 [42]Open in a new tab Summary of MOI data Left panel shows the number of proteins with annotated MOI. Right panel displays the number of proteins with annotations regarding their molecular mechanism. Functional effect data We collected 1,276 proteins with annotated functional effect; among them, 250 (20%) were only DN, 376 (29%) were only HI, 251 (20%) were only GOF, 114 (9%) were both DN and HI, 115 (9%) were both DN and GOF, 92 (7%) were both HI and GOF, and 78 (6%) were all of the DN, HI, GOF ([43]Figure 2, right). PPI construction and annotation We constructed a comprehensive PPI network comprising 17,248 nodes and 375,494 edges by integrating interactions from STRINGdb (search tool for the retrieval of interacting proteins database),[44]^11 BioGRID (biological general repository for interaction datasets),[45]^12 the Human Reference Interactome (HuRI),[46]^13 and Menche et al.[47]^14 To characterize proteins, we annotated them with 78 selected features covering structural, functional, evolutionary, and regulatory properties ([48]Table S1). Protein graph construction and annotation We obtained predicted protein structures from the AlphaFold database[49]^15 and constructed residue-level graphs using Graphein.[50]^16 In these graphs, nodes represent amino acids, while edges capture peptide bonds, hydrogen bonds, disulfide bonds, ionic interactions, and other structural contacts, including long-range interactions. We annotated amino acid residues with 73 selected features reflecting structural, sequence-based, biochemical, and evolutionary characteristics ([51]Table S2). Model development Study design We formulated MOI prediction as a node classification task within the PPI network and functional effect prediction as a graph classification task. Both models employed a multi-label classification approach, allowing each input to have multiple labels. We evaluated various GNN architectures, including graph convolutional networks (GCNs),[52]^17 graph attention networks (GATs),[53]^18 and graph isomorphism networks (GINs).[54]^19 Data splitting To construct training, validation, and test sets, we clustered human protein sequences using MMseqs2[55]^20 with stringent thresholds of 20% sequence identity and 20% alignment coverage. This conservative cutoff minimizes sequence similarity between splits, thereby reducing the risk of information leakage and encouraging the model to generalize. Protein clusters were then assigned to the training (80%), validation (10%), or test (10%) set. Hyperparameter tuning and model training All models used a single hidden layer, with the output layer containing two units for MOI prediction (AD and AR) and three units for functional effect prediction (DN, HI, and GOF). To tune the hyperparameters, we evaluated 25 configurations on the validation set by varying the hidden layer size across five values (128, 64, 32, 16, and 8) and the learning rate across five values ranging from [MATH: 102 :MATH] to [MATH: 5×104 :MATH] . The results of hyperparameter tuning for MOI and functional effect prediction are provided in [56]Tables S3 and [57]S4. Using the selected hyperparameters, we trained each model with binary cross-entropy loss for up to 100 epochs, applying early stopping based on validation loss to prevent overfitting. Model performance evaluation MOI models We evaluated all trained models on the unseen test set ([58]Table 1). The GCN model achieved the highest precision score, while the GAT model had the best recall and [MATH: F1 :MATH] score. Due to the class imbalance in the MOI dataset, we prioritized maximizing [MATH: F1 :MATH] score and selected the GAT model. We also assessed the performance of DOMINO[59]^4 as outlined in the [60]STAR Methods section and found that our models outperformed it. Table 1. MOI prediction performance on the test set Metric GCN GAT GIN LDA[61]^4 F1 0.745 0.750 0.671 0.685 Precision 0.776 0.770 0.764 0.721 Recall 0.725 0.731 0.621 0.654 [62]Open in a new tab GCN, graph convolutional network; GAT, graph attention network; GIN, graph isomorphism network; LDA, linear discriminant analysis. Functional effect models [63]Table 2 shows the performance of various models on the functional effect test set, with the GCN model achieving the highest [MATH: F1 :MATH] score. We also evaluated the SVM models from Badonyi and Marsh[64]^8 as described in the [65]STAR Methods section. Based on the overall performance, we selected the GCN model for functional effect prediction. Table 2. Functional effect prediction performance on the test set Metric GCN GAT GIN SVM[66]^8 F1 0.627 0.590 0.600 0.593 Precision 0.605 0.517 0.549 0.669 Recall 0.659 0.712 0.676 0.535 [67]Open in a new tab GCN, graph convolutional network; GAT, graph attention network; GIN, graph isomorphism network; SVM, support vector machine. Model interpretation MOI feature attribution Using the GAT model, we calculated features attribution separately for correctly predicted AD or AR proteins in the test set. We observed that the most important predictors for AD prediction are features related to constraint and conservation ([68]Figure 3, left). The top feature was UNEECON (unified inference of variant effects and gene constraints), which measures the evolutionary pressure.[69]^21 Using the labeled data, we observed that AD proteins have higher UNEECON values compared to AR proteins ([70]Figure 3, right). Figure 3. [71]Figure 3 [72]Open in a new tab GAT model interpretation for AD proteins Left panel shows the relative feature importance values for AD prediction. Right panel displays the distribution of the top feature (UNEECON score) for AD and AR proteins. For AR prediction, the most important feature was pLI (probability of loss-of-function intolerance), which is probability of loss-of-function intolerance[73]^22 ([74]Figure 4, left). Using the ground truth dataset, we observed that AR proteins have lower pLI values compared to AD proteins ([75]Figure 4, right). Figure 4. [76]Figure 4 [77]Open in a new tab GAT model interpretation for AR proteins Left panel shows the relative feature importance values for AR prediction. Right panel displays the distribution of the top feature (pLI) for AD and AR proteins. Functional effect feature attribution Using the GCN model, we measured features attribution for correctly predicted DN, HI, and GOF proteins. Because features are at residue level and prediction are at protein level, we cannot draw direct conclusions from these measurements, yet they can help to understand the associations. For DN proteins, the most important feature was the RNA-binding score based on DRNApred[78]^23 ([79]Figure 5, left). Using the labeled data, we observed that residues in DN proteins have higher RNA-binding scores compared to HI and GOF proteins ([80]Figure S1). Figure 5. [81]Figure 5 [82]Open in a new tab GCN model interpretation Relative feature attribution values for DN (left), HI (middle), and GOF (right) predictions. For HI proteins, as shown in [83]Figure 5 (middle), the topological domain is the strongest predictor. This feature was derived from UniProt.[84]^24 We observed that HI proteins have a lower fraction of topological domains compared to DN and GOF proteins ([85]Figure S2). Feature attribution analysis for GOF proteins showed that the top feature is the helix structure ([86]Figure 5, right), derived from UniProt.[87]^24 The distribution of helical fractions indicates that GOF proteins have a relatively higher fraction of helical structures compared to HI and DN proteins ([88]Figure S3). Proteome-wide inference MOI prediction for all autosomal proteins Of the 17,248 nodes in the PPI network, 16,477 (96%) were autosomal, and we used the GAT model to predict the most likely MOI for all of them. A total of 8,869 (54%) were predicted to be AR, 6,277 (38%) were predicted to be AD, and 1,206 (7%) were predicted to be ADAR (autosomal dominant and autosomal recessive) ([89]Figure S4). As expected, we observed a strong negative correlation between the probability of being AD and AR (Pearson correlation coefficient = −0.95) ([90]Figure S5). Finally, we performed pathway enrichment analyses for AD and AR proteins separately. AD proteins were significantly enriched in pathways associated with gene regulation ([91]Figure 6 left), while AR proteins were significantly overrepresented in mitochondrial pathways ([92]Figure 6, middle). Using the ground truth dataset, we observed that AR proteins are more likely to be localized inside mitochondria compared to AD proteins ( [MATH: OR=3.13,CI=[2.47,3.97] :MATH] ) ([93]Figure 6, right). Figure 6. [94]Figure 6 [95]Open in a new tab Pathway analysis for AD and AR proteins Left and middle panels show top significantly enriched pathways for AD and AR proteins, respectively. Right panel shows the number of proteins associated with subcellular localization inside or outside mitochondria. The odds ratio was calculated as [MATH: (AR_in sideAR_outside)/(AD_in sideAD_outside) :MATH] . p value was calculated using the Fisher’s exact test. Functional effect prediction for all AD-predicted proteins Based on the proteome-wide MOI predictions, we identified 7,483 AD or ADAR proteins and predicted their functional effect using the GCN model. Among them, 2,043 (28%) were classified as only DN, 1,097 (15%) as only HI, and 415 (6%) as only GOF. Additionally, 1,843 (26%) were both DN and HI, 1,569 (22%) were both DN and GOF, 181 (3%) were both HI and GOF, and 35 (1%) were classified as DN, HI, and GOF ([96]Figure S6). We also provide the counts based on AD-only and ADAR-only proteins in [97]Figures S7 and [98]S8, respectively. Pathway enrichment analysis revealed that DN proteins are enriched in pathways associated with filament organization ([99]Figure 7, left), HI proteins are overrepresented in pathways related to transcription regulation ([100]Figure 7, middle), and GOF proteins are enriched in pathways related to ion transport across membranes ([101]Figure 7, right). Figure 7. [102]Figure 7 [103]Open in a new tab Pathway analysis for DN, HI, and GOF proteins The panels show the top significantly enriched pathways for proteins associated with DN (left), HI (middle), and GOF (right) mechanisms. Discussion In this work, we introduce a novel framework that integrates GNNs with structural interactomics to predict both the MOI and the functional effects of mutated proteins in genetic disorders. By leveraging PPI network and high-resolution protein structures, we offer a graph-of-graphs approach that addresses two critical aspects of genetic disease prediction. This allows us to not only classify proteins as AD or AR but also predict whether AD diseases manifest through HI, GOF, or DN mechanisms. Our framework demonstrated good performance in predicting MOI, with the GAT model achieving the best [MATH: F1 :MATH] score for identifying AD and AR proteins. In terms of functional effects, the GCN model effectively classified HI, GOF, and DN proteins based on structural features. The most important feature in predicting AD proteins is the evolutionary pressure score (UNEECON).[104]^21 This aligns with previous studies showing that AD proteins experience stronger negative selection than AR proteins.[105]^25^,[106]^26 Additionally, we observed a strong enrichment of AD proteins in pathways related to gene expression regulation. Prior research has shown that transcription factors (TFs) are often dosage sensitive, particularly haploinsufficient, leading to dominant disease phenotypes.[107]^27^,[108]^28 This is consistent with the fact that many human birth defects and neurodevelopmental disorders are caused by mutations in a single copy of TFs and chromatin regulator genes.[109]^29 For AR proteins, we found that the pLI (probability of loss-of-function intolerance) index is the most important feature. The pLI index estimates the likelihood that knocking out one copy of a gene will result in a phenotype.[110]^22 Low-pLI genes typically exhibit functional redundancy or possess sufficient reserve capacity, allowing heterozygous carriers to remain asymptomatic.[111]^30 We also observed that AR proteins are enriched in mitochondrial pathways, consistent with previous findings that the vast majority of nuclear-encoded mitochondrial disease genes follow a recessive inheritance pattern.[112]^31 This bias toward recessive inheritance likely arises because defects in energy metabolism generally become pathogenic only when both alleles are disrupted. As long as one allele remains functional, mitochondrial pathways can sustain baseline energy production, preventing deleterious consequences.[113]^32^,[114]^33 Feature attribution analysis revealed that DN proteins are strongly associated with high RNA-binding scores, consistent with the previous observation that DN mutations are enriched in nucleic acid-binding pathways.[115]^8 One possible explanation is that DN mutations often occur at critical interaction sites, such as DNA/RNA-binding interfaces, where they allow the mutant protein to retain its ability to bind partners but disrupt the overall function of the interacting complex. Additionally, we observed an enrichment of DN proteins in pathways related to filament organization. This is likely due to the inherent susceptibility of filamentous and polymeric assemblies to “poisoning” by mutant subunits, which can incorporate into multimers and destabilize the entire structure. A well-documented example is keratin-related disorders, where keratins (type I and II intermediate filament proteins) form an essential cytoskeletal network in epithelial cells. Mutations in keratin genes lead to cell fragility and are inherited in an AD manner, with the mutations exerting their effect through a DN mechanism.[116]^34 A depletion in topological domain emerged as the most important feature in predicting HI proteins. According to UniProt, the topological domain annotation defines the subcellular compartment in which each non-membrane region of a membrane-spanning protein is located. Our findings indicate that DN and GOF proteins have a higher fraction of topological domain annotations compared to HI proteins, suggesting that HI genes encode fewer membrane-spanning proteins than DN and GOF genes. This distinction may reflect fundamental differences in the functional roles of these proteins and their sensitivity to dosage effects. Furthermore, HI proteins are significantly enriched in pathways related to transcriptional regulation, consistent with previous findings that TFs are frequently dosage sensitive and particularly prone to HI.[117]^27^,[118]^28 Finally, the most important feature for predicting GOF proteins is helix structure. This finding is consistent with previous reports showing that GOF variants are significantly more likely to occur in alpha helices.[119]^6 We also observed that GOF genes are enriched in pathways related to ion transport across membranes, further supporting the structural-functional link between helices and membrane proteins. A notable example is epilepsy-associated genes: approximately 25% of them encode ion channels, many of which causing epilepsy through a GOF mechanism.[120]^35 This association is biologically plausible, as many membrane proteins have a core architecture of transmembrane helices, which are critical for gating and transport functions.[121]^36^,[122]^37 Moving forward, there are several avenues for expanding this work. Incorporating tissue-specific PPI networks and expression data could improve the precision of our predictions, especially for proteins with context-dependent functions.[123]^38 Additionally, expanding the model to account for more complex inheritance patterns, such as polygenic traits and epistasis, could provide a more comprehensive understanding of genetic disease.[124]^39 Moreover, improving the interpretability of models in biological contexts remains essential to derive more actionable insights from the predictions.[125]^40 Finally, integrating these predictions with other computational tools and databases could further enhance our understanding of genetic diseases by providing a more holistic view of their underlying mechanisms.[126]^41^,[127]^42^,[128]^43 Limitations of the study Although our approach provides a comprehensive view of inheritance patterns and functional effects, it has several limitations. First, the availability of high-quality structural data for all human proteins remains limited, potentially affecting prediction accuracy.[129]^44 To ensure uniform coverage, we relied on AlphaFold-predicted structures, which offer high accuracy for well-folded domains but have notable drawbacks. Unlike experimentally resolved structures, AlphaFold does not capture conformational flexibility, ligand interactions, or post-translational modifications, all of which are critical for functional interpretation. Additionally, it is less reliable for intrinsically disordered regions and dynamic protein states, where structural plasticity plays a key role.[130]^45 Beyond structural considerations, our reliance on existing PPI network data introduces potential biases, as interaction coverage varies across tissues and biological contexts.[131]^38 Furthermore, class imbalance in labeled training data may affect model performance, particularly for underrepresented functional categories. Finally, while our method effectively predicts the functional effects of AD proteins, it does not extend to other inheritance patterns or interactions influenced by polygenic or epistatic effects.[132]^46 Resource availability Lead contact Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Jacques Fellay (jacques.fellay@epfl.ch). Materials availability This study did not generate new materials. Data and code availability * • Data: [133]https://zenodo.org/records/13969533. * • Code: [134]https://github.com/AliSaadatV/Structural-Interactomics. * • Other items: any additional information required to reanalyze the data reported in this article is available from the [135]lead contact upon request. Acknowledgments