Graphical abstract
graphic file with name fx1.jpg
[25]Open in a new tab
Highlights
* •
A graph-of-graphs approach integrates multi-scale protein
information
* •
Graph neural networks predict both mode of inheritance and
functional mechanisms
* •
Feature attribution reveals biological insights into genetic
disease mechanisms
* •
Models were applied to all autosomal proteins, and predictions are
publicly available
__________________________________________________________________
molecular interaction; molecular network; neural networks; quantitative
genetics
Introduction
Most human genetic diseases result from variants that disrupt protein
function through diverse molecular mechanisms, which play a critical
role in determining their mode of inheritance (MOI).[26]^1 In autosomal
dominant (AD) disorders, a single copy of a mutated gene can result in
disease, often through loss of function (LOF) due to haploinsufficiency
(HI), where the remaining wild-type allele cannot compensate for the
lost function.[27]^2 Dominant disorders can also result from non-LOF
mechanisms, such as gain of function (GOF), where the mutant protein
acquires a new or altered function, and the dominant-negative (DN)
effect, where the mutant protein interferes with the normal function of
the wild-type protein.[28]^3 In contrast, autosomal recessive (AR)
disorders require variants in both gene copies, predominantly involving
LOF mechanisms, such as missense variants that destabilize protein
structure or nonsense variants leading to truncated, non-functional
proteins.
Previous studies on MOI prediction have introduced computational tools
such as DOMINO,[29]^4 which utilizes linear discriminant analysis (LDA)
to predict whether a protein is associated with AD disorders by
integrating various features such as genomic data, conservation, and
protein interactions. MOI-Pred,[30]^5 on the other hand, focuses on
variant-level predictions, specifically targeting missense variants
associated with AR diseases.
More recent research has aimed at predicting the functional impact of
variants in specific genes. LoGoFunc combines gene-, protein-, and
variant-level features to predict pathogenic GOF, LOF, and neutral
variants.[31]^6 Another study explored the structural effects of
variants, finding that non-LOF variants tend to have milder impacts on
protein structure.[32]^7 Additionally, a recent study employed three
support vector machines (SVMs) to predict protein coding genes
associated with DN, GOF, and HI mechanisms.[33]^8
In this study, we present a comprehensive approach for predicting the
MOI for all proteins encoded by autosomal genes, as well as elucidating
the functional effect of variants underlying AD genetic disorders
([34]Figure 1). Our framework combines graph neural networks
(GNNs)[35]^9 with structural interactomics by creating a graph of
graphs,[36]^10 utilizing both protein-protein interaction (PPI) network
and high-resolution protein structures. For MOI prediction, we model
proteins as nodes within the PPI network, incorporating topological and
protein-level features for classification. For molecular mechanism
prediction, we represent each protein as a graph of amino acid
residues, leveraging structure-based features to classify the
functional effect as HI, GOF, or DN. This integrated approach enables
proteome-wide prediction of inheritance patterns and provides
mechanistic insights into AD diseases, offering a novel, scalable
framework for understanding genetic disorders.
Figure 1.
[37]Figure 1
[38]Open in a new tab
Overview of the study
The mode of inheritance (MOI) is first predicted for all autosomal
proteins in the protein-protein interaction network. Next,
AlphaFold-predicted protein structures are used to generate
residue-level graphs for each dominant protein, which are then used to
predict functional effects. Figure created with
[39]https://www.biorender.com.
For the sake of flow and conciseness, we refer to “proteins associated
with AD disorders” as AD proteins and “proteins associated with AR
disorders” as AR proteins. Similarly, we use DN (GOF/LOF) proteins
instead of “proteins associated with DN (GOF/LOF) molecular disease
mechanisms.”
Results
Datasets
MOI data
We gathered 4,737 MOI-labeled proteins; among them, 2,494 (53%) were
only AR, 1,420 (30%) were only AD, and 808 (17%) were both AD and AR
([40]Figure 2, left).
Figure 2.
[41]Figure 2
[42]Open in a new tab
Summary of MOI data
Left panel shows the number of proteins with annotated MOI. Right panel
displays the number of proteins with annotations regarding their
molecular mechanism.
Functional effect data
We collected 1,276 proteins with annotated functional effect; among
them, 250 (20%) were only DN, 376 (29%) were only HI, 251 (20%) were
only GOF, 114 (9%) were both DN and HI, 115 (9%) were both DN and GOF,
92 (7%) were both HI and GOF, and 78 (6%) were all of the DN, HI, GOF
([43]Figure 2, right).
PPI construction and annotation
We constructed a comprehensive PPI network comprising 17,248 nodes and
375,494 edges by integrating interactions from STRINGdb (search tool
for the retrieval of interacting proteins database),[44]^11 BioGRID
(biological general repository for interaction datasets),[45]^12 the
Human Reference Interactome (HuRI),[46]^13 and Menche et al.[47]^14 To
characterize proteins, we annotated them with 78 selected features
covering structural, functional, evolutionary, and regulatory
properties ([48]Table S1).
Protein graph construction and annotation
We obtained predicted protein structures from the AlphaFold
database[49]^15 and constructed residue-level graphs using
Graphein.[50]^16 In these graphs, nodes represent amino acids, while
edges capture peptide bonds, hydrogen bonds, disulfide bonds, ionic
interactions, and other structural contacts, including long-range
interactions. We annotated amino acid residues with 73 selected
features reflecting structural, sequence-based, biochemical, and
evolutionary characteristics ([51]Table S2).
Model development
Study design
We formulated MOI prediction as a node classification task within the
PPI network and functional effect prediction as a graph classification
task. Both models employed a multi-label classification approach,
allowing each input to have multiple labels. We evaluated various GNN
architectures, including graph convolutional networks (GCNs),[52]^17
graph attention networks (GATs),[53]^18 and graph isomorphism networks
(GINs).[54]^19
Data splitting
To construct training, validation, and test sets, we clustered human
protein sequences using MMseqs2[55]^20 with stringent thresholds of 20%
sequence identity and 20% alignment coverage. This conservative cutoff
minimizes sequence similarity between splits, thereby reducing the risk
of information leakage and encouraging the model to generalize. Protein
clusters were then assigned to the training (80%), validation (10%), or
test (10%) set.
Hyperparameter tuning and model training
All models used a single hidden layer, with the output layer containing
two units for MOI prediction (AD and AR) and three units for functional
effect prediction (DN, HI, and GOF). To tune the hyperparameters, we
evaluated 25 configurations on the validation set by varying the hidden
layer size across five values (128, 64, 32, 16, and 8) and the learning
rate across five values ranging from
[MATH:
10−2
:MATH]
to
[MATH: 5×10−4 :MATH]
. The results of hyperparameter tuning for MOI and functional effect
prediction are provided in [56]Tables S3 and [57]S4. Using the selected
hyperparameters, we trained each model with binary cross-entropy loss
for up to 100 epochs, applying early stopping based on validation loss
to prevent overfitting.
Model performance evaluation
MOI models
We evaluated all trained models on the unseen test set ([58]Table 1).
The GCN model achieved the highest precision score, while the GAT model
had the best recall and
[MATH: F1 :MATH]
score. Due to the class imbalance in the MOI dataset, we prioritized
maximizing
[MATH: F1 :MATH]
score and selected the GAT model. We also assessed the performance of
DOMINO[59]^4 as outlined in the [60]STAR Methods section and found that
our models outperformed it.
Table 1.
MOI prediction performance on the test set
Metric GCN GAT GIN LDA[61]^4
F1 0.745 0.750 0.671 0.685
Precision 0.776 0.770 0.764 0.721
Recall 0.725 0.731 0.621 0.654
[62]Open in a new tab
GCN, graph convolutional network; GAT, graph attention network; GIN,
graph isomorphism network; LDA, linear discriminant analysis.
Functional effect models
[63]Table 2 shows the performance of various models on the functional
effect test set, with the GCN model achieving the highest
[MATH: F1 :MATH]
score. We also evaluated the SVM models from Badonyi and Marsh[64]^8 as
described in the [65]STAR Methods section. Based on the overall
performance, we selected the GCN model for functional effect
prediction.
Table 2.
Functional effect prediction performance on the test set
Metric GCN GAT GIN SVM[66]^8
F1 0.627 0.590 0.600 0.593
Precision 0.605 0.517 0.549 0.669
Recall 0.659 0.712 0.676 0.535
[67]Open in a new tab
GCN, graph convolutional network; GAT, graph attention network; GIN,
graph isomorphism network; SVM, support vector machine.
Model interpretation
MOI feature attribution
Using the GAT model, we calculated features attribution separately for
correctly predicted AD or AR proteins in the test set. We observed that
the most important predictors for AD prediction are features related to
constraint and conservation ([68]Figure 3, left). The top feature was
UNEECON (unified inference of variant effects and gene constraints),
which measures the evolutionary pressure.[69]^21 Using the labeled
data, we observed that AD proteins have higher UNEECON values compared
to AR proteins ([70]Figure 3, right).
Figure 3.
[71]Figure 3
[72]Open in a new tab
GAT model interpretation for AD proteins
Left panel shows the relative feature importance values for AD
prediction. Right panel displays the distribution of the top feature
(UNEECON score) for AD and AR proteins.
For AR prediction, the most important feature was pLI (probability of
loss-of-function intolerance), which is probability of loss-of-function
intolerance[73]^22 ([74]Figure 4, left). Using the ground truth
dataset, we observed that AR proteins have lower pLI values compared to
AD proteins ([75]Figure 4, right).
Figure 4.
[76]Figure 4
[77]Open in a new tab
GAT model interpretation for AR proteins
Left panel shows the relative feature importance values for AR
prediction. Right panel displays the distribution of the top feature
(pLI) for AD and AR proteins.
Functional effect feature attribution
Using the GCN model, we measured features attribution for correctly
predicted DN, HI, and GOF proteins. Because features are at residue
level and prediction are at protein level, we cannot draw direct
conclusions from these measurements, yet they can help to understand
the associations.
For DN proteins, the most important feature was the RNA-binding score
based on DRNApred[78]^23 ([79]Figure 5, left). Using the labeled data,
we observed that residues in DN proteins have higher RNA-binding scores
compared to HI and GOF proteins ([80]Figure S1).
Figure 5.
[81]Figure 5
[82]Open in a new tab
GCN model interpretation
Relative feature attribution values for DN (left), HI (middle), and GOF
(right) predictions.
For HI proteins, as shown in [83]Figure 5 (middle), the topological
domain is the strongest predictor. This feature was derived from
UniProt.[84]^24 We observed that HI proteins have a lower fraction of
topological domains compared to DN and GOF proteins ([85]Figure S2).
Feature attribution analysis for GOF proteins showed that the top
feature is the helix structure ([86]Figure 5, right), derived from
UniProt.[87]^24 The distribution of helical fractions indicates that
GOF proteins have a relatively higher fraction of helical structures
compared to HI and DN proteins ([88]Figure S3).
Proteome-wide inference
MOI prediction for all autosomal proteins
Of the 17,248 nodes in the PPI network, 16,477 (96%) were autosomal,
and we used the GAT model to predict the most likely MOI for all of
them. A total of 8,869 (54%) were predicted to be AR, 6,277 (38%) were
predicted to be AD, and 1,206 (7%) were predicted to be ADAR (autosomal
dominant and autosomal recessive) ([89]Figure S4). As expected, we
observed a strong negative correlation between the probability of being
AD and AR (Pearson correlation coefficient = −0.95) ([90]Figure S5).
Finally, we performed pathway enrichment analyses for AD and AR
proteins separately. AD proteins were significantly enriched in
pathways associated with gene regulation ([91]Figure 6 left), while AR
proteins were significantly overrepresented in mitochondrial pathways
([92]Figure 6, middle). Using the ground truth dataset, we observed
that AR proteins are more likely to be localized inside mitochondria
compared to AD proteins (
[MATH: OR=3.13,CI=[2.47,
mo>3.97] :MATH]
) ([93]Figure 6, right).
Figure 6.
[94]Figure 6
[95]Open in a new tab
Pathway analysis for AD and AR proteins
Left and middle panels show top significantly enriched pathways for AD
and AR proteins, respectively. Right panel shows the number of proteins
associated with subcellular localization inside or outside
mitochondria. The odds ratio was calculated as
[MATH: (AR_in
sideAR_outside)/(AD_in
sideAD_outside) :MATH]
. p value was calculated using the Fisher’s exact test.
Functional effect prediction for all AD-predicted proteins
Based on the proteome-wide MOI predictions, we identified 7,483 AD or
ADAR proteins and predicted their functional effect using the GCN
model. Among them, 2,043 (28%) were classified as only DN, 1,097 (15%)
as only HI, and 415 (6%) as only GOF. Additionally, 1,843 (26%) were
both DN and HI, 1,569 (22%) were both DN and GOF, 181 (3%) were both HI
and GOF, and 35 (1%) were classified as DN, HI, and GOF ([96]Figure
S6). We also provide the counts based on AD-only and ADAR-only proteins
in [97]Figures S7 and [98]S8, respectively.
Pathway enrichment analysis revealed that DN proteins are enriched in
pathways associated with filament organization ([99]Figure 7, left), HI
proteins are overrepresented in pathways related to transcription
regulation ([100]Figure 7, middle), and GOF proteins are enriched in
pathways related to ion transport across membranes ([101]Figure 7,
right).
Figure 7.
[102]Figure 7
[103]Open in a new tab
Pathway analysis for DN, HI, and GOF proteins
The panels show the top significantly enriched pathways for proteins
associated with DN (left), HI (middle), and GOF (right) mechanisms.
Discussion
In this work, we introduce a novel framework that integrates GNNs with
structural interactomics to predict both the MOI and the functional
effects of mutated proteins in genetic disorders. By leveraging PPI
network and high-resolution protein structures, we offer a
graph-of-graphs approach that addresses two critical aspects of genetic
disease prediction. This allows us to not only classify proteins as AD
or AR but also predict whether AD diseases manifest through HI, GOF, or
DN mechanisms. Our framework demonstrated good performance in
predicting MOI, with the GAT model achieving the best
[MATH: F1 :MATH]
score for identifying AD and AR proteins. In terms of functional
effects, the GCN model effectively classified HI, GOF, and DN proteins
based on structural features.
The most important feature in predicting AD proteins is the
evolutionary pressure score (UNEECON).[104]^21 This aligns with
previous studies showing that AD proteins experience stronger negative
selection than AR proteins.[105]^25^,[106]^26 Additionally, we observed
a strong enrichment of AD proteins in pathways related to gene
expression regulation. Prior research has shown that transcription
factors (TFs) are often dosage sensitive, particularly
haploinsufficient, leading to dominant disease
phenotypes.[107]^27^,[108]^28 This is consistent with the fact that
many human birth defects and neurodevelopmental disorders are caused by
mutations in a single copy of TFs and chromatin regulator
genes.[109]^29
For AR proteins, we found that the pLI (probability of loss-of-function
intolerance) index is the most important feature. The pLI index
estimates the likelihood that knocking out one copy of a gene will
result in a phenotype.[110]^22 Low-pLI genes typically exhibit
functional redundancy or possess sufficient reserve capacity, allowing
heterozygous carriers to remain asymptomatic.[111]^30 We also observed
that AR proteins are enriched in mitochondrial pathways, consistent
with previous findings that the vast majority of nuclear-encoded
mitochondrial disease genes follow a recessive inheritance
pattern.[112]^31 This bias toward recessive inheritance likely arises
because defects in energy metabolism generally become pathogenic only
when both alleles are disrupted. As long as one allele remains
functional, mitochondrial pathways can sustain baseline energy
production, preventing deleterious consequences.[113]^32^,[114]^33
Feature attribution analysis revealed that DN proteins are strongly
associated with high RNA-binding scores, consistent with the previous
observation that DN mutations are enriched in nucleic acid-binding
pathways.[115]^8 One possible explanation is that DN mutations often
occur at critical interaction sites, such as DNA/RNA-binding
interfaces, where they allow the mutant protein to retain its ability
to bind partners but disrupt the overall function of the interacting
complex. Additionally, we observed an enrichment of DN proteins in
pathways related to filament organization. This is likely due to the
inherent susceptibility of filamentous and polymeric assemblies to
“poisoning” by mutant subunits, which can incorporate into multimers
and destabilize the entire structure. A well-documented example is
keratin-related disorders, where keratins (type I and II intermediate
filament proteins) form an essential cytoskeletal network in epithelial
cells. Mutations in keratin genes lead to cell fragility and are
inherited in an AD manner, with the mutations exerting their effect
through a DN mechanism.[116]^34
A depletion in topological domain emerged as the most important feature
in predicting HI proteins. According to UniProt, the topological domain
annotation defines the subcellular compartment in which each
non-membrane region of a membrane-spanning protein is located. Our
findings indicate that DN and GOF proteins have a higher fraction of
topological domain annotations compared to HI proteins, suggesting that
HI genes encode fewer membrane-spanning proteins than DN and GOF genes.
This distinction may reflect fundamental differences in the functional
roles of these proteins and their sensitivity to dosage effects.
Furthermore, HI proteins are significantly enriched in pathways related
to transcriptional regulation, consistent with previous findings that
TFs are frequently dosage sensitive and particularly prone to
HI.[117]^27^,[118]^28
Finally, the most important feature for predicting GOF proteins is
helix structure. This finding is consistent with previous reports
showing that GOF variants are significantly more likely to occur in
alpha helices.[119]^6 We also observed that GOF genes are enriched in
pathways related to ion transport across membranes, further supporting
the structural-functional link between helices and membrane proteins. A
notable example is epilepsy-associated genes: approximately 25% of them
encode ion channels, many of which causing epilepsy through a GOF
mechanism.[120]^35 This association is biologically plausible, as many
membrane proteins have a core architecture of transmembrane helices,
which are critical for gating and transport
functions.[121]^36^,[122]^37
Moving forward, there are several avenues for expanding this work.
Incorporating tissue-specific PPI networks and expression data could
improve the precision of our predictions, especially for proteins with
context-dependent functions.[123]^38 Additionally, expanding the model
to account for more complex inheritance patterns, such as polygenic
traits and epistasis, could provide a more comprehensive understanding
of genetic disease.[124]^39 Moreover, improving the interpretability of
models in biological contexts remains essential to derive more
actionable insights from the predictions.[125]^40 Finally, integrating
these predictions with other computational tools and databases could
further enhance our understanding of genetic diseases by providing a
more holistic view of their underlying
mechanisms.[126]^41^,[127]^42^,[128]^43
Limitations of the study
Although our approach provides a comprehensive view of inheritance
patterns and functional effects, it has several limitations. First, the
availability of high-quality structural data for all human proteins
remains limited, potentially affecting prediction accuracy.[129]^44 To
ensure uniform coverage, we relied on AlphaFold-predicted structures,
which offer high accuracy for well-folded domains but have notable
drawbacks. Unlike experimentally resolved structures, AlphaFold does
not capture conformational flexibility, ligand interactions, or
post-translational modifications, all of which are critical for
functional interpretation. Additionally, it is less reliable for
intrinsically disordered regions and dynamic protein states, where
structural plasticity plays a key role.[130]^45 Beyond structural
considerations, our reliance on existing PPI network data introduces
potential biases, as interaction coverage varies across tissues and
biological contexts.[131]^38 Furthermore, class imbalance in labeled
training data may affect model performance, particularly for
underrepresented functional categories. Finally, while our method
effectively predicts the functional effects of AD proteins, it does not
extend to other inheritance patterns or interactions influenced by
polygenic or epistatic effects.[132]^46
Resource availability
Lead contact
Requests for further information and resources should be directed to
and will be fulfilled by the lead contact, Jacques Fellay
(jacques.fellay@epfl.ch).
Materials availability
This study did not generate new materials.
Data and code availability
* •
Data: [133]https://zenodo.org/records/13969533.
* •
Code: [134]https://github.com/AliSaadatV/Structural-Interactomics.
* •
Other items: any additional information required to reanalyze the
data reported in this article is available from the [135]lead
contact upon request.
Acknowledgments