Graphical abstract

   graphic file with name fx1.jpg
   [25]Open in a new tab

Highlights

     * •
       A graph-of-graphs approach integrates multi-scale protein
       information
     * •
       Graph neural networks predict both mode of inheritance and
       functional mechanisms
     * •
       Feature attribution reveals biological insights into genetic
       disease mechanisms
     * •
       Models were applied to all autosomal proteins, and predictions are
       publicly available
     __________________________________________________________________

   molecular interaction; molecular network; neural networks; quantitative
   genetics

Introduction

   Most human genetic diseases result from variants that disrupt protein
   function through diverse molecular mechanisms, which play a critical
   role in determining their mode of inheritance (MOI).[26]^1 In autosomal
   dominant (AD) disorders, a single copy of a mutated gene can result in
   disease, often through loss of function (LOF) due to haploinsufficiency
   (HI), where the remaining wild-type allele cannot compensate for the
   lost function.[27]^2 Dominant disorders can also result from non-LOF
   mechanisms, such as gain of function (GOF), where the mutant protein
   acquires a new or altered function, and the dominant-negative (DN)
   effect, where the mutant protein interferes with the normal function of
   the wild-type protein.[28]^3 In contrast, autosomal recessive (AR)
   disorders require variants in both gene copies, predominantly involving
   LOF mechanisms, such as missense variants that destabilize protein
   structure or nonsense variants leading to truncated, non-functional
   proteins.

   Previous studies on MOI prediction have introduced computational tools
   such as DOMINO,[29]^4 which utilizes linear discriminant analysis (LDA)
   to predict whether a protein is associated with AD disorders by
   integrating various features such as genomic data, conservation, and
   protein interactions. MOI-Pred,[30]^5 on the other hand, focuses on
   variant-level predictions, specifically targeting missense variants
   associated with AR diseases.

   More recent research has aimed at predicting the functional impact of
   variants in specific genes. LoGoFunc combines gene-, protein-, and
   variant-level features to predict pathogenic GOF, LOF, and neutral
   variants.[31]^6 Another study explored the structural effects of
   variants, finding that non-LOF variants tend to have milder impacts on
   protein structure.[32]^7 Additionally, a recent study employed three
   support vector machines (SVMs) to predict protein coding genes
   associated with DN, GOF, and HI mechanisms.[33]^8

   In this study, we present a comprehensive approach for predicting the
   MOI for all proteins encoded by autosomal genes, as well as elucidating
   the functional effect of variants underlying AD genetic disorders
   ([34]Figure 1). Our framework combines graph neural networks
   (GNNs)[35]^9 with structural interactomics by creating a graph of
   graphs,[36]^10 utilizing both protein-protein interaction (PPI) network
   and high-resolution protein structures. For MOI prediction, we model
   proteins as nodes within the PPI network, incorporating topological and
   protein-level features for classification. For molecular mechanism
   prediction, we represent each protein as a graph of amino acid
   residues, leveraging structure-based features to classify the
   functional effect as HI, GOF, or DN. This integrated approach enables
   proteome-wide prediction of inheritance patterns and provides
   mechanistic insights into AD diseases, offering a novel, scalable
   framework for understanding genetic disorders.

Figure 1.

   [37]Figure 1
   [38]Open in a new tab

   Overview of the study

   The mode of inheritance (MOI) is first predicted for all autosomal
   proteins in the protein-protein interaction network. Next,
   AlphaFold-predicted protein structures are used to generate
   residue-level graphs for each dominant protein, which are then used to
   predict functional effects. Figure created with
   [39]https://www.biorender.com.

   For the sake of flow and conciseness, we refer to “proteins associated
   with AD disorders” as AD proteins and “proteins associated with AR
   disorders” as AR proteins. Similarly, we use DN (GOF/LOF) proteins
   instead of “proteins associated with DN (GOF/LOF) molecular disease
   mechanisms.”

Results

Datasets

MOI data

   We gathered 4,737 MOI-labeled proteins; among them, 2,494 (53%) were
   only AR, 1,420 (30%) were only AD, and 808 (17%) were both AD and AR
   ([40]Figure 2, left).

Figure 2.

   [41]Figure 2
   [42]Open in a new tab

   Summary of MOI data

   Left panel shows the number of proteins with annotated MOI. Right panel
   displays the number of proteins with annotations regarding their
   molecular mechanism.

Functional effect data

   We collected 1,276 proteins with annotated functional effect; among
   them, 250 (20%) were only DN, 376 (29%) were only HI, 251 (20%) were
   only GOF, 114 (9%) were both DN and HI, 115 (9%) were both DN and GOF,
   92 (7%) were both HI and GOF, and 78 (6%) were all of the DN, HI, GOF
   ([43]Figure 2, right).

PPI construction and annotation

   We constructed a comprehensive PPI network comprising 17,248 nodes and
   375,494 edges by integrating interactions from STRINGdb (search tool
   for the retrieval of interacting proteins database),[44]^11 BioGRID
   (biological general repository for interaction datasets),[45]^12 the
   Human Reference Interactome (HuRI),[46]^13 and Menche et al.[47]^14 To
   characterize proteins, we annotated them with 78 selected features
   covering structural, functional, evolutionary, and regulatory
   properties ([48]Table S1).

Protein graph construction and annotation

   We obtained predicted protein structures from the AlphaFold
   database[49]^15 and constructed residue-level graphs using
   Graphein.[50]^16 In these graphs, nodes represent amino acids, while
   edges capture peptide bonds, hydrogen bonds, disulfide bonds, ionic
   interactions, and other structural contacts, including long-range
   interactions. We annotated amino acid residues with 73 selected
   features reflecting structural, sequence-based, biochemical, and
   evolutionary characteristics ([51]Table S2).

Model development

Study design

   We formulated MOI prediction as a node classification task within the
   PPI network and functional effect prediction as a graph classification
   task. Both models employed a multi-label classification approach,
   allowing each input to have multiple labels. We evaluated various GNN
   architectures, including graph convolutional networks (GCNs),[52]^17
   graph attention networks (GATs),[53]^18 and graph isomorphism networks
   (GINs).[54]^19

Data splitting

   To construct training, validation, and test sets, we clustered human
   protein sequences using MMseqs2[55]^20 with stringent thresholds of 20%
   sequence identity and 20% alignment coverage. This conservative cutoff
   minimizes sequence similarity between splits, thereby reducing the risk
   of information leakage and encouraging the model to generalize. Protein
   clusters were then assigned to the training (80%), validation (10%), or
   test (10%) set.

Hyperparameter tuning and model training

   All models used a single hidden layer, with the output layer containing
   two units for MOI prediction (AD and AR) and three units for functional
   effect prediction (DN, HI, and GOF). To tune the hyperparameters, we
   evaluated 25 configurations on the validation set by varying the hidden
   layer size across five values (128, 64, 32, 16, and 8) and the learning
   rate across five values ranging from
   [MATH:
   <mrow><msup><mn>10</mn><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow>
   :MATH]
   to
   [MATH: <mrow><mn>5</mn><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn>4</mn
   ></mrow></msup></mrow> :MATH]
   . The results of hyperparameter tuning for MOI and functional effect
   prediction are provided in [56]Tables S3 and [57]S4. Using the selected
   hyperparameters, we trained each model with binary cross-entropy loss
   for up to 100 epochs, applying early stopping based on validation loss
   to prevent overfitting.

Model performance evaluation

MOI models

   We evaluated all trained models on the unseen test set ([58]Table 1).
   The GCN model achieved the highest precision score, while the GAT model
   had the best recall and
   [MATH: <mrow><msub><mi>F</mi><mn>1</mn></msub></mrow> :MATH]
   score. Due to the class imbalance in the MOI dataset, we prioritized
   maximizing
   [MATH: <mrow><msub><mi>F</mi><mn>1</mn></msub></mrow> :MATH]
   score and selected the GAT model. We also assessed the performance of
   DOMINO[59]^4 as outlined in the [60]STAR Methods section and found that
   our models outperformed it.

Table 1.

   MOI prediction performance on the test set
    Metric    GCN   GAT   GIN  LDA[61]^4
   F1        0.745 0.750 0.671 0.685
   Precision 0.776 0.770 0.764 0.721
   Recall    0.725 0.731 0.621 0.654
   [62]Open in a new tab

   GCN, graph convolutional network; GAT, graph attention network; GIN,
   graph isomorphism network; LDA, linear discriminant analysis.

Functional effect models

   [63]Table 2 shows the performance of various models on the functional
   effect test set, with the GCN model achieving the highest
   [MATH: <mrow><msub><mi>F</mi><mn>1</mn></msub></mrow> :MATH]
   score. We also evaluated the SVM models from Badonyi and Marsh[64]^8 as
   described in the [65]STAR Methods section. Based on the overall
   performance, we selected the GCN model for functional effect
   prediction.

Table 2.

   Functional effect prediction performance on the test set
    Metric    GCN   GAT   GIN  SVM[66]^8
   F1        0.627 0.590 0.600 0.593
   Precision 0.605 0.517 0.549 0.669
   Recall    0.659 0.712 0.676 0.535
   [67]Open in a new tab

   GCN, graph convolutional network; GAT, graph attention network; GIN,
   graph isomorphism network; SVM, support vector machine.

Model interpretation

MOI feature attribution

   Using the GAT model, we calculated features attribution separately for
   correctly predicted AD or AR proteins in the test set. We observed that
   the most important predictors for AD prediction are features related to
   constraint and conservation ([68]Figure 3, left). The top feature was
   UNEECON (unified inference of variant effects and gene constraints),
   which measures the evolutionary pressure.[69]^21 Using the labeled
   data, we observed that AD proteins have higher UNEECON values compared
   to AR proteins ([70]Figure 3, right).

Figure 3.

   [71]Figure 3
   [72]Open in a new tab

   GAT model interpretation for AD proteins

   Left panel shows the relative feature importance values for AD
   prediction. Right panel displays the distribution of the top feature
   (UNEECON score) for AD and AR proteins.

   For AR prediction, the most important feature was pLI (probability of
   loss-of-function intolerance), which is probability of loss-of-function
   intolerance[73]^22 ([74]Figure 4, left). Using the ground truth
   dataset, we observed that AR proteins have lower pLI values compared to
   AD proteins ([75]Figure 4, right).

Figure 4.

   [76]Figure 4
   [77]Open in a new tab

   GAT model interpretation for AR proteins

   Left panel shows the relative feature importance values for AR
   prediction. Right panel displays the distribution of the top feature
   (pLI) for AD and AR proteins.

Functional effect feature attribution

   Using the GCN model, we measured features attribution for correctly
   predicted DN, HI, and GOF proteins. Because features are at residue
   level and prediction are at protein level, we cannot draw direct
   conclusions from these measurements, yet they can help to understand
   the associations.

   For DN proteins, the most important feature was the RNA-binding score
   based on DRNApred[78]^23 ([79]Figure 5, left). Using the labeled data,
   we observed that residues in DN proteins have higher RNA-binding scores
   compared to HI and GOF proteins ([80]Figure S1).

Figure 5.

   [81]Figure 5
   [82]Open in a new tab

   GCN model interpretation

   Relative feature attribution values for DN (left), HI (middle), and GOF
   (right) predictions.

   For HI proteins, as shown in [83]Figure 5 (middle), the topological
   domain is the strongest predictor. This feature was derived from
   UniProt.[84]^24 We observed that HI proteins have a lower fraction of
   topological domains compared to DN and GOF proteins ([85]Figure S2).

   Feature attribution analysis for GOF proteins showed that the top
   feature is the helix structure ([86]Figure 5, right), derived from
   UniProt.[87]^24 The distribution of helical fractions indicates that
   GOF proteins have a relatively higher fraction of helical structures
   compared to HI and DN proteins ([88]Figure S3).

Proteome-wide inference

MOI prediction for all autosomal proteins

   Of the 17,248 nodes in the PPI network, 16,477 (96%) were autosomal,
   and we used the GAT model to predict the most likely MOI for all of
   them. A total of 8,869 (54%) were predicted to be AR, 6,277 (38%) were
   predicted to be AD, and 1,206 (7%) were predicted to be ADAR (autosomal
   dominant and autosomal recessive) ([89]Figure S4). As expected, we
   observed a strong negative correlation between the probability of being
   AD and AR (Pearson correlation coefficient = −0.95) ([90]Figure S5).

   Finally, we performed pathway enrichment analyses for AD and AR
   proteins separately. AD proteins were significantly enriched in
   pathways associated with gene regulation ([91]Figure 6 left), while AR
   proteins were significantly overrepresented in mitochondrial pathways
   ([92]Figure 6, middle). Using the ground truth dataset, we observed
   that AR proteins are more likely to be localized inside mitochondria
   compared to AD proteins (
   [MATH: <mrow><mi>O</mi><mi>R</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>3.13</mn><mo>,</mo><mi>C</mi><mi>I</mi
   ><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo>[</mo><mrow><mn>2.47</mn><mo>,</
   mo><mn>3.97</mn></mrow><mo>]</mo></mrow></mrow> :MATH]
   ) ([93]Figure 6, right).

Figure 6.

   [94]Figure 6
   [95]Open in a new tab

   Pathway analysis for AD and AR proteins

   Left and middle panels show top significantly enriched pathways for AD
   and AR proteins, respectively. Right panel shows the number of proteins
   associated with subcellular localization inside or outside
   mitochondria. The odds ratio was calculated as
   [MATH: <mrow><mrow><mo
   stretchy="true">(</mo><mfrac><mrow><mtext>AR</mtext><mo>_</mo><mtext>in
   side</mtext></mrow><mrow><mtext>AR</mtext><mo>_</mo><mtext>outside</mte
   xt></mrow></mfrac><mo stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak" linebreakstyle="after">/</mo><mrow><mo
   stretchy="true">(</mo><mfrac><mrow><mtext>AD</mtext><mo>_</mo><mtext>in
   side</mtext></mrow><mrow><mtext>AD</mtext><mo>_</mo><mtext>outside</mte
   xt></mrow></mfrac><mo stretchy="true">)</mo></mrow></mrow> :MATH]
   . p value was calculated using the Fisher’s exact test.

Functional effect prediction for all AD-predicted proteins

   Based on the proteome-wide MOI predictions, we identified 7,483 AD or
   ADAR proteins and predicted their functional effect using the GCN
   model. Among them, 2,043 (28%) were classified as only DN, 1,097 (15%)
   as only HI, and 415 (6%) as only GOF. Additionally, 1,843 (26%) were
   both DN and HI, 1,569 (22%) were both DN and GOF, 181 (3%) were both HI
   and GOF, and 35 (1%) were classified as DN, HI, and GOF ([96]Figure
   S6). We also provide the counts based on AD-only and ADAR-only proteins
   in [97]Figures S7 and [98]S8, respectively.

   Pathway enrichment analysis revealed that DN proteins are enriched in
   pathways associated with filament organization ([99]Figure 7, left), HI
   proteins are overrepresented in pathways related to transcription
   regulation ([100]Figure 7, middle), and GOF proteins are enriched in
   pathways related to ion transport across membranes ([101]Figure 7,
   right).

Figure 7.

   [102]Figure 7
   [103]Open in a new tab

   Pathway analysis for DN, HI, and GOF proteins

   The panels show the top significantly enriched pathways for proteins
   associated with DN (left), HI (middle), and GOF (right) mechanisms.

Discussion

   In this work, we introduce a novel framework that integrates GNNs with
   structural interactomics to predict both the MOI and the functional
   effects of mutated proteins in genetic disorders. By leveraging PPI
   network and high-resolution protein structures, we offer a
   graph-of-graphs approach that addresses two critical aspects of genetic
   disease prediction. This allows us to not only classify proteins as AD
   or AR but also predict whether AD diseases manifest through HI, GOF, or
   DN mechanisms. Our framework demonstrated good performance in
   predicting MOI, with the GAT model achieving the best
   [MATH: <mrow><msub><mi>F</mi><mn>1</mn></msub></mrow> :MATH]
   score for identifying AD and AR proteins. In terms of functional
   effects, the GCN model effectively classified HI, GOF, and DN proteins
   based on structural features.

   The most important feature in predicting AD proteins is the
   evolutionary pressure score (UNEECON).[104]^21 This aligns with
   previous studies showing that AD proteins experience stronger negative
   selection than AR proteins.[105]^25^,[106]^26 Additionally, we observed
   a strong enrichment of AD proteins in pathways related to gene
   expression regulation. Prior research has shown that transcription
   factors (TFs) are often dosage sensitive, particularly
   haploinsufficient, leading to dominant disease
   phenotypes.[107]^27^,[108]^28 This is consistent with the fact that
   many human birth defects and neurodevelopmental disorders are caused by
   mutations in a single copy of TFs and chromatin regulator
   genes.[109]^29

   For AR proteins, we found that the pLI (probability of loss-of-function
   intolerance) index is the most important feature. The pLI index
   estimates the likelihood that knocking out one copy of a gene will
   result in a phenotype.[110]^22 Low-pLI genes typically exhibit
   functional redundancy or possess sufficient reserve capacity, allowing
   heterozygous carriers to remain asymptomatic.[111]^30 We also observed
   that AR proteins are enriched in mitochondrial pathways, consistent
   with previous findings that the vast majority of nuclear-encoded
   mitochondrial disease genes follow a recessive inheritance
   pattern.[112]^31 This bias toward recessive inheritance likely arises
   because defects in energy metabolism generally become pathogenic only
   when both alleles are disrupted. As long as one allele remains
   functional, mitochondrial pathways can sustain baseline energy
   production, preventing deleterious consequences.[113]^32^,[114]^33

   Feature attribution analysis revealed that DN proteins are strongly
   associated with high RNA-binding scores, consistent with the previous
   observation that DN mutations are enriched in nucleic acid-binding
   pathways.[115]^8 One possible explanation is that DN mutations often
   occur at critical interaction sites, such as DNA/RNA-binding
   interfaces, where they allow the mutant protein to retain its ability
   to bind partners but disrupt the overall function of the interacting
   complex. Additionally, we observed an enrichment of DN proteins in
   pathways related to filament organization. This is likely due to the
   inherent susceptibility of filamentous and polymeric assemblies to
   “poisoning” by mutant subunits, which can incorporate into multimers
   and destabilize the entire structure. A well-documented example is
   keratin-related disorders, where keratins (type I and II intermediate
   filament proteins) form an essential cytoskeletal network in epithelial
   cells. Mutations in keratin genes lead to cell fragility and are
   inherited in an AD manner, with the mutations exerting their effect
   through a DN mechanism.[116]^34

   A depletion in topological domain emerged as the most important feature
   in predicting HI proteins. According to UniProt, the topological domain
   annotation defines the subcellular compartment in which each
   non-membrane region of a membrane-spanning protein is located. Our
   findings indicate that DN and GOF proteins have a higher fraction of
   topological domain annotations compared to HI proteins, suggesting that
   HI genes encode fewer membrane-spanning proteins than DN and GOF genes.
   This distinction may reflect fundamental differences in the functional
   roles of these proteins and their sensitivity to dosage effects.
   Furthermore, HI proteins are significantly enriched in pathways related
   to transcriptional regulation, consistent with previous findings that
   TFs are frequently dosage sensitive and particularly prone to
   HI.[117]^27^,[118]^28

   Finally, the most important feature for predicting GOF proteins is
   helix structure. This finding is consistent with previous reports
   showing that GOF variants are significantly more likely to occur in
   alpha helices.[119]^6 We also observed that GOF genes are enriched in
   pathways related to ion transport across membranes, further supporting
   the structural-functional link between helices and membrane proteins. A
   notable example is epilepsy-associated genes: approximately 25% of them
   encode ion channels, many of which causing epilepsy through a GOF
   mechanism.[120]^35 This association is biologically plausible, as many
   membrane proteins have a core architecture of transmembrane helices,
   which are critical for gating and transport
   functions.[121]^36^,[122]^37

   Moving forward, there are several avenues for expanding this work.
   Incorporating tissue-specific PPI networks and expression data could
   improve the precision of our predictions, especially for proteins with
   context-dependent functions.[123]^38 Additionally, expanding the model
   to account for more complex inheritance patterns, such as polygenic
   traits and epistasis, could provide a more comprehensive understanding
   of genetic disease.[124]^39 Moreover, improving the interpretability of
   models in biological contexts remains essential to derive more
   actionable insights from the predictions.[125]^40 Finally, integrating
   these predictions with other computational tools and databases could
   further enhance our understanding of genetic diseases by providing a
   more holistic view of their underlying
   mechanisms.[126]^41^,[127]^42^,[128]^43

Limitations of the study

   Although our approach provides a comprehensive view of inheritance
   patterns and functional effects, it has several limitations. First, the
   availability of high-quality structural data for all human proteins
   remains limited, potentially affecting prediction accuracy.[129]^44 To
   ensure uniform coverage, we relied on AlphaFold-predicted structures,
   which offer high accuracy for well-folded domains but have notable
   drawbacks. Unlike experimentally resolved structures, AlphaFold does
   not capture conformational flexibility, ligand interactions, or
   post-translational modifications, all of which are critical for
   functional interpretation. Additionally, it is less reliable for
   intrinsically disordered regions and dynamic protein states, where
   structural plasticity plays a key role.[130]^45 Beyond structural
   considerations, our reliance on existing PPI network data introduces
   potential biases, as interaction coverage varies across tissues and
   biological contexts.[131]^38 Furthermore, class imbalance in labeled
   training data may affect model performance, particularly for
   underrepresented functional categories. Finally, while our method
   effectively predicts the functional effects of AD proteins, it does not
   extend to other inheritance patterns or interactions influenced by
   polygenic or epistatic effects.[132]^46

Resource availability

Lead contact

   Requests for further information and resources should be directed to
   and will be fulfilled by the lead contact, Jacques Fellay
   (jacques.fellay@epfl.ch).

Materials availability

   This study did not generate new materials.

Data and code availability

     * •
       Data: [133]https://zenodo.org/records/13969533.
     * •
       Code: [134]https://github.com/AliSaadatV/Structural-Interactomics.
     * •
       Other items: any additional information required to reanalyze the
       data reported in this article is available from the [135]lead
       contact upon request.

Acknowledgments