Abstract
Gene set enrichment analysis (GSEA) is a powerful tool to associate a
disease phenotype to a group of genes/proteins. GSEA attributes a
specific weight to each gene/protein in the input list that depends on
a metric of choice, which is usually represented by quantitative
expression data. However, expression data are not always available.
Here, GSEA based on betweenness centrality of a protein–protein
interaction (PPI) network is described and applied to two cases, where
an expression metric is missing. First, personalized PPI networks were
generated from genes displaying alterations (assessed by array
comparative genomic hybridization and whole exome sequencing) in four
probands bearing a 16p13.11 microdeletion in common and several other
point variants. Patients showed disease phenotypes linked to
neurodevelopment. All networks were assembled around a cluster of first
interactors of altered genes with high betweenness centrality. All four
clusters included genes known to be involved in neurodevelopmental
disorders with different centrality. Moreover, the GSEA results pointed
out to the evidence of “cell cycle” among enriched pathways. Second, a
large interaction network obtained by merging proteomics studies on
three neurodegenerative disorders was analyzed from the topological
point of view. We observed that most central proteins are often linked
to Parkinson’s disease. The selection of these proteins improved the
specificity of GSEA, with “Metabolism of amino acids and derivatives”
and “Cellular response to stress or external stimuli” as top-ranked
enriched pathways. In conclusion, betweenness centrality revealed to be
a suitable metric for GSEA. Thus, centrality-based GSEA represents an
opportunity for precision medicine and network medicine.
Keywords: systems medicine, network medicine, gene set enrichment
analysis, topological analysis, neurodevelopment, neurodegeneration
Introduction
High-throughput data consist in a wide amount of information obtained
as the output of last-generation technologies. Nowadays, data from
omics approaches are obtained to systematically explore human biology
at a cellular or molecular level. This leads to a significant advantage
in the study of a biological system in its complexity ([37]Tebani et
al., 2016). On the other hand, an analytical method capable of reading,
prioritizing, and interpreting such large set of information is needed.
The integration of the medical/biological language and the
mathematical/computational language in a cross-disciplinary approach
represents a challenge ([38]Barabási, 2007). For this reason, new
theories and new algorithms have now been generated, and a strong
support of bioinformatics tools becomes necessary ([39]Al-Haggar et
al., 2013).
Gene set enrichment analysis (GSEA) is a powerful tool for the
interpretation of high-throughput expression studies such as mass
spectrometry-based proteomics or Next-Generation Sequencing, in order
to identify insights into biological processes or pathways underlying a
given phenotype. Thanks to the acquisition of an expression profile,
the list of differentially expressed genes is ranked in terms of a
metric associated to the observed expression change. The ranked gene
list is compared to a gene set, i.e., a list of genes known to be
associated with a certain biological process, gene ontology (GO),
molecular function or pathway. The metric is needed to calculate the
enrichment score (ES) that indicates the degree by which a gene set is
overrepresented at the extremes of the ranked list ([40]Subramanian et
al., 2005).
In the Big Data era, meaningful models and impactful results may only
be achieved at the network complexity level. In particular, genomics
and quantitative approaches to network-based analysis are combined to
advance the frontiers of network medicine ([41]Sonawane et al., 2019).
Indeed, the multifactorial approach of network medicine is based on the
multiplicity of factors that can alter a system to identify functional
connections that link the clinical phenotype to multiple factors
([42]Piñero et al., 2016).
Usually, protein networks are built starting from a list of query
proteins and by retrieving the protein–protein interaction (PPI)
information from a suitable database. The resulting network may be
further enriched by looking for first interactors that might join
isolated or distant nodes. The rationale of this approach is that
proteins that were identified as phenotype-correlated by chance are
likely excluded from the network, whereas proteins that for several
reasons were not detected as phenotype-correlated are now reconnected
to the network ([43]Fasano et al., 2016). A standard functional
analysis of the resulting network usually consists in an
over-representation analysis based on the Fisher’s exact test or a GSEA
([44]Monti et al., 2019).
The Graph Theory provides tools to analyze the connectivity structure
and the topology of a network. Node centrality and edge betweenness are
among the most useful parameters to identify hotspots in a network. The
importance of a node is the extent of its involvement in a network, and
this can be measured in multiple ways. Common measures include degree,
closeness, betweenness and eigenvector centralities ([45]Ashtiani et
al., 2018). The latter is at the core of the PageRank and the
hypertext-induced topics search (HITS) algorithms for ranking networks
([46]Kleinberg, 1999). The betweenness centrality of a given node takes
into account the number of geodesics, or shortest paths, between any
couple of nodes that include the considered node over the total number
of geodesics that connect any couple of nodes of the graph. When nodes
are in distinct components, the geodesic pathway is not defined.
Otherwise, the betweenness centrality ranges from 0 to 1. On the other
hand, the closeness centrality is inversely proportional to the sum of
the shortest paths between the given node and any other node in the
network.
Several algorithms have been proposed to combine pathway enrichment
analysis with network topological information and their performance has
been recently reviewed ([47]Ma et al., 2019). Some of them take into
account interconnectivity or adjacency information, whereas other
consider centrality measures such as degree or betweenness centrality.
Nevertheless, all topology-based pathway enrichment analysis methods
require an expression metric that indicate whether a gene is
differentially expressed in a group comparison.
Since whole genome sequencing (WGS) and whole exome sequencing (WES)
are useful genomics tools to identify genes showing alterations (e.g.,
single nucleotide variants – SNVs, Indels, larger duplications or
deletions), the results generated by these techniques can be used to
generate personalized networks for each individual ([48]Suwinski et
al., 2019). More generally, altered genes can be considered as
constituting the input list for generating a PPI network that reflects
a disease network model in a single individual, thus allowing for the
comparison between personalized networks of patients with the same
phenotype and a distinct genetic background.
Here, we propose a GSEA approach to analyze PPI networks based on
betweenness centrality values as ranking metrics. The proposed analysis
was applied to two case paradigms. In the first case, the network was
obtained from genomics data, where quantitative expression metrics are
not available. In the second case, the network represents the result of
a secondary data analysis, where proteomics data are scraped from the
literature and quantitative information is not always available.
Materials and Methods
Definition of Gene Signatures From Genomics Analysis
The patients and their parents came from the cohort of medical genetics
clinics of the ASST Sette Laghi showing clinical phenotypes related to
neurodevelopmental disorders. A written informed consent to perform
array-based comparative genomic hybridization (aCGH) and WES, and a
consent for use of anonymous data for research were provided by the
parents and relatives of the patients. The consent model and procedure
were approved by the Institutional Review Board (ASST Sette Laghi Cod
MOD09 IOS01SSDGM). Neurocognitive tests were based on WISC-IV (Wechsler
Intelligence Scale for Children). Language skills were assessed with
Boston and TROG-2 tests. No bias selection of gender, family history or
age was performed. To identify submicroscopic chromosomal
rearrangements, the aCGH technology was performed after DNA extraction
from peripheral blood cells (QIAmp DNA blood Maxi Kit, Qiagen, Hilden,
Germany). The four patients were selected on the basis of the presence
of 16p13.11 microdeletion.
For exome results, genes involved in neurodevelopmental disorders and
genes related to rare diseases were considered according to literature.
Among these, genes with variants in hemizygosity, homozygosity,
compound heterozygosity and simple heterozygosity inherited from the
healthy non-carrier parent of the 16p13.11 deletion were selected.
Array-based comparative genomic hybridization was performed using
CytoSure ISCA V2 4x180K platform with a backbone resolution of 1
probe/25 Kb and 1 probe/19 Kb in critical regions, human genome
reference hg19/GRCh37 and sex matched normal human DNA pool (Kreatech,
Amsterdam, Holland) as control. InnoScan 710 Microarray Scanner
(Carbonne, France) and Mapix (Innopsys, Carbonne, France) were used to
detect and analyze fluorescence levels. Results were interpreted using
Cytosure Interpret Software (Oxford Gene Technology, Begbroke,
Oxfordshire, United Kingdom).
Whole exome sequencing was performed on probands and parents DNA using
the Twist Human Core Exome Kit (Twist Bioscience, San Francisco, CA,
United States), according to the manufacturer’s protocol and sequenced
on the Illumina NovaSeq 6000 platform. The BaseSpace pipeline
(Illumina, San Diego, CA, United States) and the TGex software (LifeMap
Sciences, Alameda, CA, United States) were used for the variant calling
and annotation, respectively. Sequencing data were aligned to the
hg19/GRCh37 human reference genome. Variants with a lower coverage than
10×, quality score (GQ) <15, and gnomAD minor allele frequency (MAF)
>5% have been excluded.
Protein–Protein Interaction Networks From Genomics Analysis
Cytoscape 3.7.2^[49]1 was used to generate networks ([50]Su et al.,
2014). The public database IMEx^[51]2 was queried through Cytoscape
using the PSICQUIC standard (the Proteomics Standard Initiative Common
QUery InterfaCe). Each list of altered genes, obtained by aCGH and WES
as described above, was used to generate a network encompassing all
first interactors of the gene products. The network was filtered for
taxonomy ID 9606 (Homo sapiens) to remove homology inferences. A first
interactors list was generated and used to query the IMEx database
again, thus including second interactors too. All self loops and
duplicated edges were removed. Node degrees were calculated using the
NetworkAnalyzer plugin. Isolated (degree = 0) and terminal (degree = 1)
nodes were removed to reduce the network size, with the only exception
for isolated/terminal nodes included in the input list of altered
genes.
Protein–Protein Interaction Network From Secondary Data Analysis
Proteins reported to be associated to Parkinson’s disease (PD),
amyotrophic lateral sclerosis (ALS) and Alzheimer’s disease (AD) were
extracted from a secondary data analysis by [52]Monti et al. (2018).
Using a pubmed scraping procedure, authors obtained lists of proteins
reported to be quantitatively altered in the three neurodegenerative
disorders considered and built a network that encompasses all selected
proteins and their common interactors using PPI spider ([53]Antonov et
al., 2009). The network was then exported as a.xgmml file.
Topological Analysis in Cytoscape
Betweenness centrality was calculated for each node in networks by
using the NetworkAnalyzer utility in the Cytoscape Environment. Both
size and border width of all nodes were mapped onto their betweenness
centrality values. The betweenness centrality of node n[i] is given by
[MATH:
CB(
ni)=∑j<Kσjk<
/msub>(ni)<
/mo>σjk<
/msub>1(N2) :MATH]
where N is the total number of nodes, σ[jk] is the number of geodesics
between nodes n[j] and n[k], and σ[jk](n[i]) is the number of geodesic
pathways containing node n[i]. Networks were reduced in size by
selecting nodes having betweenness centrality ≥10^–4 for display
clarity only. For each network, betweenness centralities were
transformed with a Hill function in order to enhance the weight of
nodes with higher values, as follows hereafter:
[MATH: CB′=CBhk+CBh :MATH]
.
Gene Set Enrichment Analysis Using R
Gene set enrichment analysis was performed using the ReactomePA package
([54]Yu and He, 2016) in the R environment for statistical analysis
([55]R Core Team, 2017). Normalized enrichment score (NES), p-value,
and false discovery rate (FDR) for all variables and signatures were
obtained in the R environment. The database for biological
processes/pathways Reactome was used for the analysis ([56]Fabregat et
al., 2017). Hill-transformed betweenness centralities were used as a
suitable ranked list metric instead of those usually considered in
expression studies. The list was then resampled by randomly assigning
observed values to gene IDs and the analysis was performed again to
identify pathways in common with those arising from the analysis of
values that were obtained from the topological analysis of the
networks. The R code is reported in the [57]Supplementary Material.
Over-Representation Analysis
Over-representation analysis was performed for comparison using the
WEB-based GEne SeT AnaLysis Toolkit^[58]3 ([59]Liao et al., 2019) using
Reactome as the functional database and “human genome, protein coding”
as the reference set.
Results
Personalized Signatures From aCGH and WES Analysis
Four patients were referred to genetic investigations for diagnostic
purposes and counseling for developmental disorders ranging from
learning delay to intellectual disability, with or without associated
congenital malformations. By aCGH screening, a microdeletion on
16p13.11 was identified. The deletion on 16p13.11 was either inherited
from the healthy mother (two patients) or from the healthy father (one
patient), or it was a de novo unbalance (one patient). Several
mutations inherited from both healthy parents were found in the four
patients by WES analysis. By merging aCGH and WES results, lists of
altered genes were generated and used as personalized signatures for
each patient ([60]Table 1).
TABLE 1.
Gene signatures for the four patients.^a
Patient 1 Patient 2 Patient 3 Patient 4
ABCC1 ABCC1 ABCC1 ABCC1
ABCC6 ABCC6 ABCC6 ABCC6
ANK3 BCKDK BCKDK FOPNL
DYRK1A CHRNA7 FOPNL GPRASP1
FOPNL FOPNL IQSEC2 MARF1
GRIN2A GRM5 MARF1 MYH11
KIF1A MARF1 MYH11 NDE1
MARF1 MECP2 NDE1 NLGN4X
MYH11 MYH11 NOMO3 NOMO1
NDE1 NDE1 RGPD4 NOMO3
NOMO3 NOMO3 RRN3 NPIPA1
NPIPA1 RRN3 TG NRXN1
NTAN1 SHANK1 TSC2 NTAN1
PDXDC1 XYLT1 XYLT1 PDXDC1
RRN3 PHEX
RRN3
[61]Open in a new tab
^aLists include altered genes from both aCGH and WES results.
The datasets presented in this study can be found in online
repositories. The names of the repository/repositories and accession
number(s) are listed hereafter. For aCGH data: DECIPHER^[62]4 350680
(Patient 1), 414066 (Patient 2), 414078 (Patient 3) and 318359 (Patient
4). For WES data: European Variation Archive Project: [63]PRJEB41629,
Analyses: [64]ERZ1687005.
Genomics Networks Without Expression Metrics
The four lists of altered genes were used to generate personalized
networks. [65]Supplementary Figure 1 shows an overview of the four
networks encompassing first and second interactors after filtering out
isolated and terminal nodes (i.e., nodes with degree = 0 and degree =
1, respectively) and nodes with betweenness centrality ≤10^–4, except
for genes in the signature lists. Signature genes display different
betweenness centrality values in the four personalized models with
three orders of magnitude among different genes ([66]Table 2). None of
the interaction networks has nodes with high values. Genes MYH11 and
NDE1 are present in all four networks with relatively high
centralities, whereas ANK3, CHRNA7, DYRK1A, KIF1A, SHANK1, and TSC2,
although they are present in single networks, also have high
centrality. On the other hand, several signature genes have negligible
centrality, although they are present in all four networks (e.g.,
ABCC1, ABCC6, FOPNL, MARF1, NOMO3, and RRN3).
TABLE 2.
Betweenness centrality values (−log[10]) of all signature genes.
Patient 1 Patient 2 Patient 3 Patient 4
ABCC1 3.98 3.87 3.70 3.86
ABCC6 5.02 4.80 4.71 4.74
ANK3 3.00 – – –
BCKDK – 3.11 2.90 –
CHRNA7 – 3.18 – –
DYRK1A 2.97 – – –
FOPNL 4.63 4.69 4.61 4.60
GPRASP1 – – – 2.97
GRIN2A 5.07 – – –
GRM5 – 5.13 – –
IQSEC2 – – 4.79 –
KIF1A 2.52 – – –
MARF1 3.87 3.67 3.51 3.58
MECP2 – 2.50 – –
MYH11 2.78 2.91 2.74 2.77
NDE1 2.53 2.49 2.32 2.38
NLGN4X – – – 5.49
NOMO1 – – – 2.68
NOMO3 4.46 3.99 3.85 4.03
NPIPA1 0 – – 0
NRXN1 – – – 5.55
NTAN1 5.27 – – 4.82
PDXDC1 3.03 – – 2.94
PHEX – – – 0
RGPD4 – – 5.49 –
RRN3 4.78 4.66 4.51 4.75
SHANK1 – 3.45 – –
TG – – 0 –
TSC2 – – 2.53 –
XYLT1 – 4.19 4.07 –
[67]Open in a new tab
[68]Figure 1 shows the 10 most central nodes in each personalized
network together with interacting signature genes. In all cases, top 10
nodes are among the first interactors of signature genes, with six out
of 10 genes in common among patients (i.e., DISC1, ESR1, GOLGA2, GRB2,
JUN, and HSCB). No signature genes displayed centrality values
comparable to first interactors represented in [69]Figure 1. An
over-representation analysis of these clusters, including top 10 genes
and interacting signature genes, did not lead to any significantly
over-represented pathway.
FIGURE 1.
[70]FIGURE 1
[71]Open in a new tab
Top 10 nodes in terms of betweenness centrality and interacting
signature nodes. Red nodes are genes from the individual gene signature
and navy-blue nodes are their first interactors. Diameter and border
width of nodes are proportional to betweenness centrality, whereas edge
width is proportional to edge betweenness.
In order to enrich the functional analysis with nodes with higher
centrality, original betweenness centrality values were transformed
with a Hill function. [72]Figure 2 shows ranked centrality values
before and after the transformation. In this way, nodes with low
centrality have a negligible contribution to the ES, whereas nodes with
high centrality represent leading edges. As an example, patient 1
network had 8,101 nodes. Among them, 579 nodes had a score (i.e.,
betweenness centrality after Hill transformation) greater than 0.2, 318
greater than 0.5, and 192 greater than 0.8. Therefore, this
transformation dramatically reduced the dimensionality of the dataset
by selecting less than 10% of the original network genes. As a
comparison, we performed the same transformation using closeness
centrality. As shown in [73]Supplementary Figure 2, closeness
centrality values are distributed in a narrow range even after Hill
transformation.
FIGURE 2.
[74]FIGURE 2
[75]Open in a new tab
Centrality metrics before and after Hill transformation. The solid line
shows betweenness centrality obtained from the topological analysis of
a representative personalized network. The dashed line represents
centrality values after transformation with a Hill function.
Gene set enrichment analysis was performed for the four networks using
transformed betweenness centrality values as the scoring metric, with
Reactome as gene set database. [76]Figure 3 shows enriched categories
as a dot plot, with dot size proportional to the number of overlapping
genes. All patients displayed the enrichment of one or more pathways
related to cell cycle and to Rho GTPases in a different ranking in
terms of statistically significant gene ratio, i.e., the ratio between
the number of genes related to each pathway and the total number of
genes in the network. These pathways were not identified after
resampling of scores obtained from betweenness centrality values
([77]Supplementary Figure 3). Moreover, classic over-representation
analysis of the whole network without ranking measures led to a
different set of enriched pathways. [78]Table 3 summarizes enriched
pathways found by classic over-representation analysis of patient 1
network as a whole and by selecting subsets of nodes by their ranking
(8,101, 579, 318, and 192 genes, respectively) (see [79]Supplementary
Materials).
FIGURE 3.
[80]FIGURE 3
[81]Open in a new tab
Dot plots for the four GSEA results. Numbers refer to patients as
indicated in [82]Table 1. Dot size is proportional to the number of
overlapping genes. p-values are color-coded according to the color
scale.
TABLE 3.
Comparison of enriched pathways.
Pathway ORA All^a (8101) ORA 0.2 (579) ORA 0.5 (318) ORA 0.8 (192) GSEA
GSEA random
Translation X
RNA metabolism X X
Deubiquitination X X
Developmental biol. X
Organelle biogenesis X X
Vesicle transport X X X
Transcription X X X X
Disease X X X X
Immune system X X X X X
Cell cycle X X X X X
Apoptosis/Cell death X X X X
Cellular resp. to stress/ext. stimuli X X X X
Rho GTPases effectors/signaling X X X X
Post-transl. modification X X
Cilium assembly X X
Intracellular signaling X X
Hemostasis X
Biological oxidations X
[83]Open in a new tab
^aORA All, over-representation analysis performed on all nodes in the
network; ORA 0.2, ORA performed on 579 nodes with ranking metric >0.2;
ORA 0.5, ORA performed on 318 nodes with ranking metric >0.5; ORA 0.8,
ORA performed on 192 nodes with ranking metric >0.8; GSEA,
topology-based GSEA using all ranked genes; GSEA random, topology-based
GSEA using all ranked genes after resampling.
Consensus Networks From Secondary Data Analysis
The centrality-based GSEA approach was then applied on a consensus
network already available in the literature. In a previous paper by
[84]Monti et al. (2018), authors retrieved full-text proteomics
original articles focused on the development of three neurodegenerative
diseases and extracted three input lists including 928 proteins for AD,
1,155 proteins for PD, and 387 proteins for ALS. The input lists were
used to generate a consensus network representing interactions among
proteins in the lists ([85]Monti et al., 2018). We performed the
topological analysis of the consensus network, whose results are shown
in [86]Figure 4A, where node size and border width are proportional to
betweenness centrality. This network encompasses 483 nodes and 659
edges. Noteworthy, among the 95 nodes with the highest betweenness
centrality (i.e., the top 20% central nodes), 78 are linked to PD,
either in common or not with other diseases ([87]Figure 4B). Most
central genes are TIMP2 and EIF3A, both associated to PD only, and
GABARAPL2, associated to PD and AD. However, these three genes did not
lead to any significant association with PD pathogenesis, when taken
alone. All nodes related to PD were then extracted from the whole
network for further topological analysis and GSEA ([88]Figure 4C).
FIGURE 4.
[89]FIGURE 4
[90]Open in a new tab
Topological analysis of the proteomics neurodegeneration network from
secondary data analysis. (A) Whole network; (B) top 20% most central
nodes; (C) selection from the whole network of nodes altered in
Parkinson’s disease (PD) patients. Nodes are color-coded as follows.
Pink, PD; dark green, amyotrophic lateral sclerosis (ALS); orange,
Alzheimer’s disease (AD); yellow, PD and ALS; light green, ALS and AD;
red, PD and AD; black, PD, ALS and AD (adapted from [91]Monti et al.,
2018). Node size and border width are proportional to betweenness
centrality, whereas edge width is proportional to edge betweenness.
Gene set enrichment analysis was performed for the whole network
([92]Figure 4A) and for the PD network ([93]Figure 4C) using
Hill-transformed betweenness centrality values as the scoring metrics,
with Reactome as gene set reference database. [94]Figure 5 shows
enriched categories as a dot plot, with dot size proportional to the
number of overlapping genes. For the whole network, enriched pathways
mainly refer to catabolic pathways. Immune system appeared to be
significantly enriched; however, it was also found after resampling of
centrality metrics ([95]Supplementary Figure 4). On the other hand,
most significantly enriched pathways for the PD network were
“Metabolism of amino acids and derivatives” and “Cellular response to
stress or external stimuli,” that were not observed in the analysis of
the resampled dataset.
FIGURE 5.
[96]FIGURE 5
[97]Open in a new tab
Dot plots for the GSEA results. (A) Whole network; (B) selection of
nodes altered in Parkinson’s disease patients. Dot size is proportional
to the number of overlapping genes. P-values are color-coded according
to the color scale.
Discussion
Gene set enrichment analysis is usually performed to associate a
disease phenotype to a group of genes by using quantitative expression
data. Unlike over-representation analysis, GSEA has the advantage of
considering the role of each gene by taking into account its
correlation to the clinical phenotype or any other categorical variable
(treatment, stage, time, and other experimental conditions)
([98]Subramanian et al., 2005). Nevertheless, a quantitative metric is
not always available in a dataset. Here, GSEA based on betweenness
centrality of a PPI network is described and applied to two cases,
where an expression metric is missing.
Actually, building a network starting from a list of entities such as
genes or proteins is based on publicly available evidence of
interaction, and the topology of the network accounts for the specific
role each protein plays in the network itself. The degree of each node
(i.e., the number of interactions with other nodes) is clearly a
measure of how much this node is integrated in the network. We decided
to consider the betweenness centrality as a suitable metric to rank
network nodes in terms of the role they play in the disease represented
by the network. Among other centrality measures, betweenness centrality
discriminates centrality of nodes over a range that spans orders of
magnitudes, unlike closeness centrality values that are distributed in
a narrow range and therefore are slightly affected by the resampling
procedure. Signature genes of the first case analyzed here showed very
different centrality values in a range from 10^–6 to 10^–3. Worthy of
note, several genes showing high centrality values are involved in
neurodevelopmental disorders. Indeed, some of them (ANK3, BCKDK,
CHRNA7, DYRK1A, MECP2, SHANK1, and TSC2) are included in the database
SFARI ([99]Banerjee-Basu and Packer, 2010) with high scores, supporting
that the gene contributes to the disorder pathogenesis or it is a
strong candidate.
The personalized networks are assembled around a cluster of very
central genes, whose composition is partially overlapping.
Interestingly, the top 10 clusters never include signature genes, and
the weight of every member of the clusters may change among patients.
As an example, the leucine-rich repeat kinase 2 (LRRK2) is present in
all four networks. However, its centrality is low in patient 1 (C[B] =
7.56 × 10^–3), whereas it has the top value in the other three
patients. Recently, a role for LRRK2 in cognitive development has been
proposed, thus linking LRRK2 to intellectual disability and autism
([100]Labonne et al., 2020). Another gene present with very high
centrality in all four networks, ESR1, has been related to the
pathogenesis of autism and autism related symptoms ([101]Wang et al.,
2016).
We used the tools included in the ReactomePA package ([102]Yu and He,
2016) to associate a topological parameter to each gene in the dataset
in place of commonly used expression data. However, we observed that
betweenness centrality values show a steep decrease within the first
nodes in the ranked list. To select a sub-population of central nodes,
betweenness centrality values were transformed with a Hill function
that assigns high scores to less than 10% nodes. In this way, a drastic
change in the weight associated to each gene is expected. Using
Hill-transformed betweenness centrality values, GSEA of the four
personalized networks yielded a different ranking of enriched pathways
or even different pathways. As an example, the pathway “Cell Cycle,
Mitotic” is enriched in all four patient networks. However, its ranking
differs in the four dot plots. From the pathogenetic point of view, all
patients share a neurodevelopmental impairment that is brought to
evidence by the present analysis (e.g., “Mitotic G1-G1/S phase”
pathway). It is known from literature that neurodevelopmental disorders
are related to alterations of the cell cycle, and a central role is
played by genes involved in neurogenesis regulation and cell adhesion
processes ([103]Pramparo et al., 2015). Moreover, our evidence supports
the conclusions obtained so far on NDE1, the candidate gene for the
susceptibility to the neurocognitive disorder given by the deletion on
16p13.11. Using a bottom-up approach, NDE1 and NDEL1 (its paralogue)
were deeply characterized (i.e., biochemistry, PPIs, role in neuronal
differentiation and cortical development, pathology caused by their
alterations) ([104]Bradshaw and Hayashi, 2017). In particular, the role
of NDE1 and NDEL1 in cell cycle progression, neurite outgrowth and
development and neuronal migration has been highlighted, thus
supporting our results.
As a second application of the proposed approach, we analyzed a PPI
network previously published by our group ([105]Monti et al., 2018).
This network encompasses proteins that were reported to quantitatively
change in three neurodegenerative diseases by independent articles. Two
of them are movement disorders (PD and ALS) and one is a cognitive
disorder (AD). The original aim of the work was to distinguish between
proteins that are specifically associated to a single neurodegenerative
disease and proteins generically linked to neurodegeneration.
Topological analysis of the network adds substance to this purpose. We
observed that most central proteins are often linked to PD, either
standalone or associated to other disorders. Indeed, the most evident
colors in [106]Figure 4B are pink (PD), red (PD and AD) and black (PD,
AD and ALS). To gain more evidence to pathways associated to most
central nodes, we extracted a subnetwork to represent all proteins that
were altered in PD. This selection, together with the topological
analysis, allowed us to improve the specificity of GSEA.
The use of a topological parameter to perform GSEA in the absence of
any quantitative expression metric was never done before. This
application of GSEA has been performed on two datasets, where
expression metrics were not available. Although we decided to consider
betweenness centrality as a ranking metric, further analyses could be
performed by identifying other candidate topological parameters and
using these data to obtain new pathway networks and new information.
Conclusion
The centrality-based GSEA procedure is another powerful tool to
investigate personalized networks or disease networks, in order to
unveil hidden information and to specifically plan experimental
investigations. This analysis can be applied to a wide variety of
networks, also in combination with clustering tools. Since biological
networks are usually too large and are characterized by high
modularity, topological descriptors could assist the identification of
functional modules ([107]Sharma et al., 2017; [108]Choobdar et al.,
2019). Altogether, the functional analysis of networks by weighting
nodes in terms of their centrality could also provide a valuable tool
to explore pathogenetic mechanisms and to precisely identify sensitive
targets for drug development or repositioning. In this view,
centrality-based GSEA represents an opportunity for precision medicine
and network medicine.
Data Availability Statement
The datasets presented in this study can be found in online
repositories. The names of the repository/repositories and accession
number(s) can be found below: aCGH data: DECIPHER
[109]https://decipher.sanger.ac.uk/350680 (Patient 1), 414066 (Patient
2), 414078 (Patient 3) and 318359 (Patient 4). WES data: European
Variation Archive Project: [110]PRJEB41629, Analyses: [111]ERZ1687005.
Ethics Statement
The studies involving human participants were reviewed and approved by
ASST Sette Laghi ethical committee. Written informed consent to
participate in this study was provided by the participants’ legal
guardian/next of kin.
Author Contributions
AZ performed the cytogenetic analysis, bioinformatics analysis and
wrote the manuscript. ML and TA revised the GSEA results and the
manuscript. PG performed the cytogenetic analysis and contributed to
the interpretation of whole exome sequencing results. DC and AN
performed the whole exome sequencing analysis. RC enrolled the
patients, carried out medical genetics visits, supervised genetics
analysis, and revised the manuscript. MF supervised the project, wrote
the code, and wrote the manuscript. All authors approved the final
version of the manuscript.
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Acknowledgments