Graphical abstract
graphic file with name fx1.jpg
[37]Open in a new tab
Highlights
* •
Hypergraph modeling enhances scRNA-seq representation by reducing
sparsity
* •
Capturing cellular heterogeneity and gene modules uncovers complex
GRN relationships
* •
HyperG-VAE surpasses benchmarks in predicting GRNs and identifying
key regulators
* •
Tested on B cells, HyperG-VAE excels in gene regulation,
clustering, and lineage tracing
Motivation
Gene regulatory networks (GRNs) derived from single-cell RNA sequencing
(scRNA-seq) data provide insights into the complex interactions between
transcription factors (TFs) and target genes. This capability enables a
detailed understanding of gene expression regulation and cellular
function across diverse populations. However, addressing both cellular
heterogeneity and gene modules remains a significant challenge, as
current GRN inference methods struggle to bridge this gap. To address
this limitation, we introduce hypergraph variational autoencoder
(HyperG-VAE), an algorithm based on hypergraph representation learning
that captures latent correlations among genes and cells, enhancing the
imputation of contact maps.
__________________________________________________________________
Su et al. present HyperG-VAE, a hypergraph-based model integrating
cellular heterogeneity and gene modules for robust GRN inference from
scRNA-seq data. By capturing latent gene-cell correlations, it
overcomes sparsity, outperforms existing methods, and reveals key gene
regulation patterns with robust downstream analysis.
Introduction
Gene regulatory networks (GRNs) within single-cell RNA sequencing
(scRNA-seq) datasets present a sophisticated interplay of transcription
factors (TFs) and target genes, uniquely capturing the modulation of
gene expression and thereby delineating the intricate cellular
functions and responses within diverse cell populations.[38]^1 GRNs
illuminate core biological processes and underpin applications from
disease modeling to therapeutic design,[39]^2^,[40]^3^,[41]^4
empowering researchers to interpret the mechanisms of gene interactions
within cells and leverage this understanding for medical and
biotechnological innovations.[42]^5^,[43]^6
Numerous methodologies have emerged for inferring GRNs from single-cell
transcriptomic data. The algorithms emphasize co-expression networks in
a statistical way (e.g., PPCOR[44]^7 and LocaTE[45]^8) or aim to
decipher causal relationships between TFs and their target genes based
on the analysis of the gene interactions among cells (e.g.,
DeepSEM[46]^9 and PIDC[47]^10). Despite their achievements, these
algorithms still have inherent limitations. Specifically, these
approaches mainly focus on cellular heterogeneity and overlook the
critical importance of simultaneously considering cellular
heterogeneity and gene module information in the model design.
Generally, from the view of underlying principles, we can divide the
methodologies into deep learning methods and traditional statistical
algorithms. Many deep learning-based (e.g., DeepTFni[48]^11 and
DeepSEM[49]^9) methodologies primarily build upon foundational
models.[50]^12^,[51]^13 The frequent oversight in these models is the
inherent relationships between cells and genes, as informed by domain
expertise. This often leads to models that compromise on explainability
and narrow their application scope. For the traditional statistical
algorithms, such as Bayesian networks[52]^14^,[53]^15 and ensemble
methods,[54]^16^,[55]^17^,[56]^18 it can be computationally expensive,
and it remains a challenge to extend these methodologies to encompass
broader nonlinear paradigms.
Additionally, the scRNA-seq data are frequently marred by noise and
incompleteness, attributable to phenomena such as amplification biases
inherent to reverse transcription and PCR amplification
processes,[57]^19^,[58]^20 as well as the issue of low quantities of
nucleic acids in single cells. To get a more robust GRN, several
methodologies[59]^21^,[60]^22 leverage multi-omic datasets, capturing
different kinds of cellular information to enrich the model’s
comprehensiveness. However, integrating multi-omic datasets presents
substantial challenges, particularly regarding harmonizing data from
disparate sources and platforms, and could also introduce additional
noise.[61]^23
To address the problems and construct a reliable GRN, we model
scRNA-seq data as a hypergraph and present hypergraph variational
autoencoder (HyperG-VAE), a Bayesian deep generative model to process
the hypergraph data. Distinct from current approaches, HyperG-VAE
simultaneously captures cellular heterogeneity and gene modules (in GRN
analysis, gene modules refer to clusters of genes that are regulated
together by the same set of TFs) through its cell and gene encoders
individually during the GRN construction. Two encoders employ
variational inference to learn stochastic representations of genes and
cells, offering a more flexible and robust approach to managing
real-world data complexities. This could be particularly effective in
handling noise in scRNA-seq datasets, a capability that has been
demonstrated in previous studies.[62]^24^,[63]^25^,[64]^26 Within a
shared embedding space, the dual encoders of our model interact,
boosting its cohesiveness. The joint optimization manner elucidates
gene regulatory mechanisms within gene modules across various cell
clusters, thereby augmenting the model’s ability to delineate complex
gene regulatory interactions and significantly improving its
explainability.
Our study evaluates the performance of HyperG-VAE in various scRNA-seq
applications. These include (1) GRN inference, (2) cell embedding, (3)
gene embedding, and (4) gene regulation hypergraph construction.
Through benchmark comparisons, encompassing tasks like GRN inference,
data visualization, and single-cell clustering, we establish that
HyperG-VAE outperforms existing state-of-the-art methods. Additionally,
HyperG-VAE demonstrates its utility in elucidating the regulatory
patterns governing B cell development in bone marrow. Our model also
excels in learning gene expression modules and cell clusters, which
connect the gene encoder and cell encoder individually to boost gene
regulatory hypergraph prediction. This integrated functionality of
HyperG-VAE improves our comprehension of single-cell transcriptomic
data, ultimately providing better insights into the realm of GRN
inference.
Results
Framework overview
We introduce HyperG-VAE, a Bayesian deep generative model specifically
designed to address the complex challenge of gene regulation network
inference using scRNA-seq data, which are represented as a hypergraph
([65]Figure 1; [66]STAR Methods). Our HyperG-VAE takes into account the
interplay between gene modules and cellular heterogeneity, allowing for
a more accurate representation of cell-specific regulatory mechanisms.
This interplay could be incorporated into a hypergraph to capture the
nuanced interactions of genes across diverse cellular states.
Figure 1.
[67]Figure 1
[68]Open in a new tab
Overview of HyperG-VAE
(A) HyperG-VAE, which takes the expression value matrix derived from
scRNA-seq data as input. In the provided table, four cells exhibit
expression across fifteen genes, with color gradients indicating
varying gene expression levels (white circles mean no expression).
(B) The colored circles with serial numbers denote distinct genes,
expressed within specific cells, functioning as interlinked nodes.
These nodes are interconnected by a singular hyperedge (small dashed
ellipses) symbolizing the cell (triangle). Together, these nodes and
hyperedges form a hypergraph structure. Node coloration reflects a
composite of gene expression levels of the given gene across cells; for
instance, gene 3 manifests a blend of green and blue hues. The largest
dashed ellipse is the genome shared by all cells.
(C) The neural network architecture of HyperG-VAE, where two encoders
are designed to process the provided input matrix. The cell encoder
uses the structural equation model (SEM) to discern cellular
heterogeneity and form the GRN, while the gene encoder, employing a
hypergraph self-attention mechanism, focuses on gene module analysis.
The decoder subsequently reconstructs the input matrix, leveraging the
shared latent space of both gene and cell embeddings. The inferred gene
regulation hypergraph integrates cellular and gene representations,
drawing on relationships derived from the learned GRN.
(D–G) Downstream tasks that can be pursued by HyperG-VAE include GRN
construction, clustering both cells and genes, and modeling the
interplay between gene modules and cellular heterogeneity. Further
details are provided in the legend, located in the top right corner.
In the context of hypergraphs, we construct the hypergraph by
representing cells as individual hyperedges, with the genes expressed
in each cell serving as the nodes included within those hyperedges
([69]Figures 1A and 1B). Specifically, let
[MATH: HV∈Rm×n<
/mrow> :MATH]
denote the scRNA-seq expression matrix, where m is the number of cells
and n is the number of genes. The incidence matrix
[MATH: M∈{0
,1}m×n
:MATH]
encodes the hypergraph structure: if a gene (node) i is expressed in
cell (hyperedge) j (
[MATH: HijV>0 :MATH]
), then
[MATH: Mij=1 :MATH]
. This construction captures the relationship between cells and their
expressed genes, enabling the sparse scRNA-seq data to be effectively
represented as a hypergraph.
HyperG-VAE incorporates two encoders, a cell encoder and a gene
encoder, enabling it to learn the hypergraph representation
[MATH: HV :MATH]
([70]Figure 1C) with structure
[MATH: M :MATH]
. The cell encoder generates cell representations
[MATH: HE :MATH]
in the form of hypergraph duality, facilitating the embedding of
high-order relations via structural equation layers. GRN construction
([71]Figure 1D) is realized in this structural equation layer through a
learnable causal interaction matrix. In addition, the cell encoder can
adeptly capture the gene regulation process in a cell-specific manner,
elucidating a clearer landscape of cellular heterogeneity
([72]Figure 1E). The gene encoder is specifically designed to process
observed gene representations, denoted as
[MATH: HV :MATH]
. Given that genes within a module generally manifest consistent
expression profiles across cells, we employ a multi-head self-attention
mechanism that is specifically designed for the hypergraph in this
work. This not only discerns varying gene expression levels but also
assigns appropriate weights to the genes expressed in the same cell
during the message-passing phase. Thus, the gene encoder enhances the
model’s ability to understand and integrate the intricate
interdependencies among genes, thereby aiding in the effective
embedding of gene clusters ([73]Figure 1F). Finally, a hypergraph
decoder is utilized to reconstruct the original topology of the
hypergraph ([74]Figure 1G) using the learned latent embedding of genes
and cells. Utilizing the reconstructed hypergraph and the learned
inter-gene relationships, we can also infer a gene regulatory
hypergraph ([75]Figure 1G). This hypergraph encompasses gene regulatory
modules that span across various cell stages.
HyperG-VAE enhances GRN inference by incorporating the above two
encoders to mutually augment each other’s embedding quality
([76]Figure 1C) while preserving the high-order gene relations among
cells, constrained by hypergraph variational evidence lower bound
([77]STAR Methods). Specifically, the cell encoder incorporates a
structure equation model (SEM) on gene co-expression space to infer the
GRNs; the learning of gene modules by the gene encoder aids in the
inference of GRNs since the gene module conceivably incorporates
TF-target regulation patterns. By integrating the embedding of genes
and cells through joint learning, we observe the substantial
performance of downstream tasks ([78]Figures 1D–1G), including the
inference of GRNs, cell clustering, gene clustering, and interplay
characterization between gene modules and cellular heterogeneity, among
others.
HyperG-VAE achieves accurate prediction of GRNs
We evaluate the performance on GRN inference of HyperG-VAE based on the
setting of the BEELINE framework.[79]^27 Our evaluation encompassed
seven scRNA-seq datasets. This includes two cell lines from humans and
five mouse cell lines (more details can be found in the
[80]supplemental information). We evaluate GRN performance using two
metrics: EPR, which assesses the enrichment of true positives among the
top K predicted edges relative to random predictions, and AUPRC, which
measures the area under the precision-recall curve to account for class
imbalance. These metrics are applied across four types of ground-truth
datasets: STRING,[81]^28 non-specific chromatin immunoprecipitation
(ChIP)-seq,[82]^29^,[83]^30^,[84]^31 cell-type-specific
ChIP-seq,[85]^32^,[86]^33^,[87]^34 and loss-/gain-of-function (LOF/GOF)
networks.[88]^34 As recommended by Pratapa et al.,[89]^27 our analysis
for each dataset prioritized the most variable TFs and the top N
most-varying genes, where N is set to 500 and
[MATH: 1,000 :MATH]
. We selected seven state-of-the-art baseline algorithms based on the
evaluation of BEELINE to compare with HyperG-VAE: DeepSEM,[90]^9
GENIE3,[91]^17 PIDC,[92]^10 GRNBoost2,[93]^18 SCODE,[94]^35
ppcor,[95]^7 and SINCERITIES.[96]^36
Overall, HyperG-VAE demonstrates a discernible enhancement in
performance when compared with other baseline methods in terms of both
AUPRC and EPR metrics ([97]Figures 2 and [98]S2). For scaled results of
datasets composed of all significantly varying TFs and the 500
most-varying genes (as shown in [99]Figure 2), HyperG-VAE surpasses the
seven other benchmarked methods in 40 of the 44 (
[MATH: 91% :MATH]
) evaluated conditions. Compared with the second-best method (DeepSEM),
HyperG-VAE enhances the results by at least
[MATH: 10% :MATH]
in 19 out of the 44 benchmarks. Furthermore, in comparison to other
commendable approaches, such as PIDC and GENIE3, our approach
registered significant enhancements. For PIDC, 38 out of 44 instances
showed improvements of over
[MATH: 10% :MATH]
, with 27 surpassing
[MATH: 30% :MATH]
and 22 going beyond
[MATH: 50% :MATH]
. Similarly, with GENIE3, 33 out of 44 instances marked at least a
[MATH: 10% :MATH]
enhancement, 26 surpassed
[MATH: 30% :MATH]
, and an impressive 20 recorded at least a
[MATH: 50% :MATH]
increase. For results of datasets composed of all significantly varying
TFs and the 1000 most-varying genes ([100]Figure S2), HyperG-VAE
achieves the best prediction performance on
[MATH: 84% :MATH]
(
[MATH: 37/44 :MATH]
) of the benchmarks. In comparison to the runner-up method, DeepSEM,
HyperG-VAE outperforms by a margin of at least
[MATH: 10% :MATH]
in 17 of the 44 evaluated benchmarks. Notably, the average enhancement
in EPR stands at
[MATH: 11.35% :MATH]
, while that in AUPRC is
[MATH: 7.16% :MATH]
.
Figure 2.
[101]Figure 2
[102]Open in a new tab
Benchmarks of different GRN inference methods on experimental scRNA-seq
datasets by EPR and AUPRC scores
The performance of HyperG-VAE is contrasted against seven alternative
algorithms across seven datasets. Each dataset comprises all
significantly varying transcription factors (TFs) and the 500
most-varying genes. These evaluations are based on four distinct
ground-truth benchmarks: non-specific ChIP-seq, STRING,
cell-type-specific ChIP-seq, and LOF/GOF. For each figure pair, the
left image depicts the median EPR results, while the right image shows
the median AUPRC outcomes. Results inferior to random predictions are
excluded from the visualizations for clarity. The color scale in each
dataset is normalized between 0 and 1 using a min-max scaling approach.
EPR is defined as the odds ratio of true positives among the top K
predicted edges, where K represents the number of edges in the
ground-truth GRN, compared to random predictions. Similarly, the AUPRC
ratio reflects the odds ratio of the AUPRC value between the model and
random predictions.
With single-cell sequencing data, robustly inferring GRNs from limited
cells is pivotal, especially for capturing rare cellular phenotypes and
transient states.[103]^9^,[104]^11 Here, we explore the fluctuations in
EPR performance and the robustness of HyperG-VAE when confronted with
limited training data ([105]Figure S3A). We constructed mouse embryonic
stem cell (mESC) datasets[106]^37 composed of all significantly varying
TFs and the 500 and 1,000 most-varying genes, respectively, and
evaluated the accuracy based on four unique ground-truth benchmarks by
randomly subsampling single cells following the BEELINE
benchmark.[107]^27 Upon adjusting the number of subsampled single cells
to 400, 300, 200, 100, and 50, we registered average performance
retentions of
[MATH: 94% :MATH]
,
[MATH: 92% :MATH]
,
[MATH: 91% :MATH]
,
[MATH: 80% :MATH]
, and
[MATH: 53% :MATH]
, respectively. Remarkably, when training with cell counts exceeding
100, a robust
[MATH: 79.17% :MATH]
(
[MATH: 19/24 :MATH]
) retained more than
[MATH: 90% :MATH]
of their performance, and for counts greater than 50, a compelling
[MATH: 87.50% :MATH]
(
[MATH: 28/32 :MATH]
) maintained above
[MATH: 80% :MATH]
efficacy. When utilizing cell-type-specific ChIP-seq as the benchmark,
the performance remains notably stable, with an average performance
retention of 93%. Furthermore, when assessed against the other
three ground truths and the training cell count exceeds 50, there is
only a modest decline in efficacy, averaging
[MATH: 88% :MATH]
performance retention in comparison to the median value derived from
all cells. Beyond performance evaluation, we also examined HyperG-VAE’s
scalability with expansive datasets ([108]Figure S3B).
HyperG-VAE reveals the gene regulation patterns of B cell development in bone
marrow
To evaluate HyperG-VAE’s proficiency in elucidating GRNs and to assess
the effectiveness of both cell clustering embedding and gene module
embedding components within HyperG-VAE, we deployed HyperG-VAE on
scRNA-seq data of B cell development in bone marrow[109]^38 (more
details of the data can be found in the [110]STAR Methods and
[111]Table S3), as illustrated in [112]Figure 3. The progression of B
cell development from hematopoietic stem cells follows a sequential yet
adaptable developmental pathway governed by interactions among
environmental stimuli, signaling cascades, and transcriptional
networks.[113]^39 Throughout this developmental trajectory, TFs play a
pivotal role in regulating the cell cycle, differentiation, and
advancement to subsequent developmental stages. These critical
checkpoints encompass the initial commitment to lymphocytic
progenitors, the specification of pre-B cells, progression through
immature stages, entry into the peripheral B cell pool, B cell
maturation, and subsequent differentiation into plasma cells.[114]^40
Each of these regulatory nodes is controlled by complex transcriptional
networks, which, along with sensing and signaling systems, determine
the final outcomes.
Figure 3.
[115]Figure 3
[116]Open in a new tab
GRN prediction by HyperG-VAE across developmental B cell states in bone
marrow
(A) t-distributed stochastic neighbor embedding (t-SNE) visualization
of cell embedding on the bone marrow B cell dataset; the embedding is
learned by the cell encoder of HyperG-VAE. Black lines depict the
trajectory from pre-pro-B cells to mature B cells.
(B) Heatmap/dot plot showing TF expression of the regulon on a color
scale and cell-type specificity (RSS [regulon specificity score]) of
the regulon on a size scale.
(C) The accuracy of GRN prediction by cross-validation with publicly
available ChIP-seq datasets. The overlap coefficient quantifies the
concordance between sets of target genes for each TF, as derived from
GRN prediction and ChIP-seq database, respectively. The x axis
represents the difference value of overlap coefficients between
HyperG-VAE and SCENIC (default). Pink lines indicate superior
performance by HyperG-VAE, while blue lines favor the default SCENIC.
The dot plot illustrates the overlap coefficient of the more effective
approach for each regulon, depicted on a color gradient. Cell states
are arranged in a sequence that reflects the progression of bone marrow
B cell development stages.
(D) The GRN visualization for the bone marrow B cell dataset with ten
states from pre-pro-B state to plasma state, as delineated by
HyperG-VAE; the inner circle shows the co-binding of shared target
genes, while the outer circle presents TF-focused target genes.
HyperG-VAE uncovers the cell embedding by dimensionality reduction and
distinctly segregates the primary cell types across various stages of
bone marrow B cell development ([117]Figure 3A). Significantly,
HyperG-VAE also effectively captures the linear progression of B cell
development, spanning from early pro-B, late pro-B, large pre-B, small
pre-B, and immature B to mature B cells. In our pursuit to unveil the
gene regulation patterns in developmental B cells, our HyperG-VAE, in
conjunction with SCENIC,[118]^41 successfully identifies established
master regulators associated with different developmental stages
([119]Figures 3B and [120]S4), including pre-pro-B (Runx2), pro-B (Ebf1
and Lef1), large pre-B (Myc and Hmgb2), small pre-B (Tcf3 and Sox4),
immature B (Relb and Egr1), mature B (Nfkb2), and plasma (Cebpb and
Prdm1) cells.
Furthermore, we conducted a benchmark assessment to compare the
performance of HyperG-VAE against SCENIC using its default settings.
Using the ChIP-seq database,[121]^33 the accuracy was evaluated based
on the degree of overlap coefficient between the ChIP-seq coverage and
the predicted target genes from both methods. Our HyperG-VAE, when
combined with SCENIC, demonstrates superior performance compared to the
standard SCENIC approach, exhibiting higher accuracy in detecting
TF-target patterns for the key TFs (as illustrated in [122]Figure 3C).
The comprehensive gene regulation network spanning the developmental B
cells in the bone marrow is depicted in [123]Figure 3D. We find that
the GRNs show TF-target regulation patterns in two ways: TFs co-binding
to shared predicted enhancers (the inner circle in [124]Figure 3D) and
TF-specific target genes (the outer circle in [125]Figure 3D). We also
observe that the cooperativity between TFs is stronger within cell
types along the development path, indicating that some TFs are involved
in multiple stages of B cell development.
Gene expression module learning enhances HyperG-VAE in GRN inference
Our HyperG-VAE model augments the GRN prediction by integrating gene
space learning, as depicted in [126]Figure 4C. HyperG-VAE uncovers the
gene expression modules visualized by uniform manifold approximation
and projection (UMAP)[127]^42 in [128]Figure 4A. By associating these
gene modules with the key TFs and corresponding target genes of
pathways along B cell development, we annotate the gene modules with
specific cell types, indicating that these gene clusters are activated
in different stages of developmental B cells ([129]Figures 4A and 4B).
Figure 4.
[130]Figure 4
[131]Open in a new tab
The interplay between gene embedding and cell clusters
(A) Gene embedding by the gene encoder of HyperG-VAE on developmental B
cell data. Gene clusters encoded by numbers are associated with
different cell types by colors.
(B) The heatmap illustrates normalized overlap values between gene
clusters and TF regulons from different B cell states. Here, genes
serve as a bridge to compute the overlap, with lighter colors
representing larger overlap scores.
(C) t-SNE visualization of cellular embeddings with highlighted
pre-BCRi B cell state, together with the associated regulon, Phf8_(+),
and related target genes.
(D) Pathway enrichment analysis on gene cluster 5 with associated
molecular complex detection (MCODE) network components.
(E) Pathway enrichment analysis of different gene clusters. The shaded
pathways show the dominant gene programs for each gene cluster.
We further apply gene set enrichment analysis (GSEA)[132]^43 ([133]STAR
Methods) to investigate the gene clusters ([134]Figures 4E and [135]S1;
[136]Tables S5 and [137]S6). The pathways identified through GSEA
validate the accuracy of our gene cluster annotations. For example,
large pre-B cells (cluster pre-BCRi [B cell receptor independent] B)
are associated with signals initiating diverse processes, which include
proliferation and recombination of the light-chain gene[138]^44; the
GSEA results show the related pathways: lymphocyte proliferation, cell
activation, and B cell receptor signaling pathway. Immature B cells
exhibit B cell central tolerance, which is governed by mechanisms such
as receptor editing and apoptosis.[139]^45 The pathways identified in
the corresponding gene clusters include antigen processing and
presentation of exogenous peptide antigen, DNA damage response,
regulation of cell killing, and apoptotic signaling pathways. Plasma
cells are terminally differentiated B lymphocytes that secrete
immunoglobulins, also known as antibodies.[140]^46 Considering the
substantial demands placed on these cells for secretory biological
processes, the pathways associated with the relevant gene cluster shed
light on the cellular response to endoplasmic reticulum stress.
We show that the gene modules are associated with different biological
pathways during B cell development in the bone marrow. These gene
modules implicitly incorporate the gene regulation patterns, leading to
different cell types. On the other hand, distinct cell types of B cell
clustering are engaged in various immunological
environments,[141]^39^,[142]^47 resulting in different signaling
pathways for B cell activation and fate decisions. We exemplify this
joint relationship with an example involving B cells at the large pre-B
stage, as shown in [143]Figures 4C and 4D. This specific cell state
([144]Figure 4C) is characterized by gene regulation patterns
associated with cell proliferation, reflected by the regulon
Phf8(+).[145]^48 The corresponding gene cluster ([146]Figure 4D) is
linked to a molecular complex detection (MCODE) network, which belongs
to the lymphocyte proliferation pathway ([147]Figure 4E) and shares
target genes with the Phf8(+) regulon.
Therefore, our HyperG-VAE reciprocally integrates these two concepts,
cell clustering and gene module detection, with the aim of revealing
GRNs. Concretely, the cell embedding process groups together similar
cells that share common pathways, while the gene modules aggregate
genes exhibiting similar regulation patterns, thereby enhancing the
accuracy of GRN computations.
HyperG-VAE constructs the cell-type-specific GRN on B cell development in
bone marrow
We have demonstrated that gene modules associated with various
biological pathways correspond to distinct cell types within bone
marrow development in B cells. Essentially, these distinct gene
regulation patterns influence cell fate commitment, leading to the
development of diverse cell types with varying gene regulation
profiles. Thus, we employ HyperG-VAE to investigate each individual
state of developmental B cells and construct a more accurate GRN for B
cells at specific developmental stages, as illustrated in
[148]Figure 5. B cell development in the bone marrow can be broadly
categorized into four states: pro-B, large pre-B, small pre-B, and
mature B.[149]^40 These four stages are visualized using UMAP, as
depicted in [150]Figure 5B. For each of these states, we employed
HyperG-VAE to compute GRNs and uncover the predominant regulatory
patterns, as illustrated in [151]Figures 5A–5C. HyperG-VAE effectively
reveals the key TFs and their associated target genes within each cell
state. For example, in the pro-B state, Ebf1[152]^49 and Pax5[153]^50
play significant roles, while Myc[154]^38 stands out in the large pre-B
state; Bach2[155]^38^,[156]^51 is crucial in the small pre-B state, and
Klf2[157]^52 and Ctcf[158]^53 are notable in the mature state.
Figure 5.
[159]Figure 5
[160]Open in a new tab
Cell-type-specific GRN analysis of developmental B cells in bone marrow
(A) The Sankey diagram shows significant regulons and corresponding
target genes of different states along B cell development in bone
barrow, with the normalized enrichment score (NES) encoded by color
shade and the area under the curve (AUC) score by dot size.
(B) Gene regulatory hypergraph at the cell clustering level,
illustrating the four principal B cell states as four hyperedges.
Conserved TFs are highlighted with red dots, and target genes are
depicted as diamonds, where size reflects the log fold change (logFC)
in gene expression of a given state compared to others. Highly
expressed genes are labeled in the figure.
(C) The motif of significant TFs along the principal stages. Additional
motif details can be found in [161]Figure S5.
(D) Heatmap displays the expression of the top genes, ranked by logFC,
across cells classified into four distinct cell states. The genes are
selected by the overlap of top logFC genes and predicted target genes.
The genes’ color corresponds to the cell states in which the regulation
pattern is predicted.
(E) Volcano plot of differentially expressed genes of different states.
The blue inverted triangles denote downregulated genes, and the red
triangles denote upregulated genes.
The aforementioned TFs, along with their respective target genes,
collectively constitute the regulons that characterize the four major
cell states, allowing for the construction of a gene regulatory
hypergraph at the cell clustering level ([162]Figures 5A and 5B). For
each major state, we overlap the top-predicted target genes by
HyperG-VAE ([163]Figure 5B) with the differentially expressed genes
(DEGs; [164]Figure 5E) and identify the principal marker genes
([165]Figure 5D). Specifically, Ebf1 and Pax5 are essential in the
pro-B state of bone marrow to maintain an early B cell phenotype
characterized by the expression of B cell-specific genes such as Vpreb
and Igll1 for surrogate light-chain production.[166]^49^,[167]^50 In
the large pre-B stage, the enriched regulons encompass the TF
Myc[168]^38 and other genes related to the cell cycle, such as Mki67,
Cenpf, Cenpa, and Hmgb2. Additionally, nucleosome-related genes, such
as Hist1h2ae and Hist1h3c, are also enriched in this state due to the
high rate of cell proliferation. In the small pre-B stage, both Bach2
and Btg1 restrain cell proliferation.[169]^54^,[170]^55 It is
noteworthy that the mature state markers H2-Ab1, H2-Eb1, H2-Aa, and
Cd74 are assigned as target genes in the pro-B stage, suggesting that
these genes may be actively repressed in the early B cell development
stage.
HyperG-VAE addresses cellular heterogeneity and learns the cell
representation
Cellular heterogeneity is a hallmark of complex biological systems,
manifesting as diverse cell types and states within scRNA-seq
datasets.[171]^56 We hypothesize that the latent space inferred by the
cell encoder of HyperG-VAE captures this biological variability among
cells. Leveraging domain expertise, we can map these clusters to known
cell types or states, ensuring that the computational predictions align
with manual inspection and annotation. To evaluate the performance, we
applied HyperG-VAE to three biologically relevant scRNA-seq datasets,
including an Alzheimer’s disease (AD) dataset,[172]^57 a colorectal
cancer dataset,[173]^58 and the widely used mouse brain dataset, known
as the Zeisel dataset[174]^59 (more dataset details can be found in
[175]Table S4). To benchmark HyperG-VAE, we also compared its
low-dimensional embeddings with those of six other algorithms:
autoCell,[176]^60 DCA,[177]^61 scVI,[178]^62 DESC,[179]^63
SAUCIE,[180]^64 and scVAE.[181]^65 We followed the Louvain
algorithm[182]^66 to cluster all the single cells into an identical
number of clusters for each method ([183]STAR Methods). To assess the
precision of clustering against established reference labels, we
employed four metrics: the adjusted Rand index (ARI), normalized mutual
information (NMI), homogeneity (HOM), and completeness (COM). These
metrics span a scale from 0, indicating random clustering, to 1,
signifying perfect alignment with reference clusters, with superior
values indicating enhanced accuracy.
Overall, the performance of HyperG-VAE surpasses that of its
counterparts, as evidenced in [184]Figure 6A. Specifically, for the
Zeisel dataset, the clusters generated using HyperG-VAE align more
closely with the existing cell type annotations, registering an NMI of
[MATH: 83.1% :MATH]
and an ARI of
[MATH: 83.7% :MATH]
. In comparison, the next best-performing algorithm, autoCell, recorded
an NMI of
[MATH: 78.0% :MATH]
and an ARI of
[MATH: 80.6% :MATH]
. Furthermore, we evaluated HyperG-VAE’s latent space to determine its
ability to capture the biological diversity among individual cells in
the Zeisel dataset, as illustrated in [185]Figure 6B. We visualized the
data embedding by UMAP. In previous sections, we showed that HyperG-VAE
effectively identifies gene expression differences within cells of the
same type. The compact UMAP highlights distinct clusters while
preserving intra-cluster heterogeneity, demonstrating its ability to
capture both inter- and intra-cluster variability. Compared to other
algorithms, the distinct separation observed with HyperG-VAE across
most clusters indicates effective clustering, suggesting that
HyperG-VAE’s cell encoder adeptly distinguishes between various cell
states or types. While algorithms such as autoCell, scVI, and scVAE
have achieved results that are comparable, the differentiation between
their clusters is not as pronounced as with HyperG-VAE. For the
remaining algorithms, the substantial overlap among clusters hinders
the classifier from producing optimal results. Specifically, compared
to other methodologies founded on conventional single-layer VAEs, the
enhanced visualization capabilities of HyperG-VAE underscore the
potential benefits of incorporating gene modules in cell embedding
processes.
Figure 6.
[186]Figure 6
[187]Open in a new tab
Benchmarks of single-cell clustering and embedding
(A) The cell clustering performance of HyperG-VAE on the single-cell
datasets compared with six baseline methods on four key metrics: NMI,
ARI, COM, and HOM. NMI, normalized mutual information (the higher the
value, the better); ARI, adjusted rand index (the higher the value, the
better); COM, completeness (the higher the value, the better); HOM,
homogeneity (the higher the value, the better).
(B) UMAP visualization of latent representations on the Zeisel dataset
for different methods. The black circles highlight areas with ambiguous
classification boundaries.
Discussion
In this work, we introduce HyperG-VAE, a sophisticated model designed
for the construction of GRNs. Uniquely, HyperG-VAE leverages a
hypergraph framework, wherein genes expressed within individual cells
are represented as nodes connected by distinct hyperedges, capturing
the latent gene correlations among single cells. As a key algorithmic
innovation of HyperG-VAE, the transformation of scRNA-seq data into a
hypergraph offers unique advantages compared to existing GRN inference
methods. These advantages include improved modeling of cellular
heterogeneity, enhanced analysis of gene modules, increased sensitivity
to gene correlations among cells, and improved visualization and
interpretation of GRNs. This direct use of a hypergraph, as opposed to
traditional pairwise methods like star expansion (SE) and clique
expansion (CE),[188]^67 captures complex multi-dimensional
relationships more effectively, avoiding the increased complexity and
information loss associated with SE and CE. By maintaining the
hypergraph’s original form, HyperG-VAE preserves the data’s full
complexity and integrity, enhancing analytical depth and reducing
computational demands.
In addition to modeling scRNA-seq data in a hypergraph, HyperG-VAE
effectively integrates gene modules and cellular heterogeneity,
demonstrating superior performance compared to existing methods. On the
one hand, our study reveals that HyperG-VAE outperforms related
existing state-of-the-art algorithms in GRN inference, cell type
classification, and visualization tasks, respectively, as evidenced by
its enhanced performance across several widely recognized benchmark
datasets. On the other hand, we also utilize HyperG-VAE on scRNA-seq
data of B cell development in bone marrow[189]^38 to evaluate its
performance in a biologically relevant context. Firstly, HyperG-VAE
achieves accurate prediction of GRNs and successfully identifies key
master regulators and target genes across different developmental
stages. Meanwhile, we cross-validated our results with publicly
available ChIP-seq datasets,[190]^33 further demonstrating HyperG-VAE’s
robust performance in predicting regulons based on GRN inference.
Secondly, subsequent evaluations across various tasks further
highlighted the effectiveness of HyperG-VAE’s carefully designed
encoder components, with their synergistic interaction significantly
bolstering the model’s overall performance. Specifically, the cell
encoder within HyperG-VAE predicts the GRNs through a structural
equation model while also pinpointing unique cell clusters and tracing
the developmental lineage of B cells; the gene encoder uncovers gene
modules that implicitly encapsulate patterns of gene regulation,
thereby enhancing the accuracy of GRN predictions. To demonstrate this
interaction, we highlight the shared genes between gene clusters and
the predicted target genes within cell clusters. These shared genes are
notably present in pathways identified by GSEA, signifying the
connections between gene modules identified by gene encoders and cell
clusters delineated by cell encoders.
Our proposed model, HyperG-VAE, holds promise as a foundational
framework, adaptable to a multitude of biological contexts in future
research endeavors. A promising future direction is extending
HyperG-VAE into a heterogeneous hypergraph VAE by incorporating
additional omics data, such as single-cell ChIP-seq. Integrating
single-cell ChIP-seq data into HyperG-VAE would enhance GRN
construction, complementing transcriptomic data and revealing upstream
regulatory events. This kind of integration would enable seamless
multi-omics data fusion and further advance GRN inference and cellular
regulation analysis. Additionally, while the present model does not
explicitly use metadata for genes and cells, future enhancements that
integrate these metadata into the hypergraph-centric framework could
significantly improve the representations of nodes (genes) and
hyperedges (cells). The weights assigned to these hyperedges can also
be factored into the model’s learning phase, offering a more
comprehensive analysis. In the generative phase of HyperG-VAE,
gene-cell interactions proceed through a cohesive mechanism,
facilitating the development of a robust GRN underscored by the
interplay between gene modules and cell clusters. Moreover, advancing
to a single-cell-level, fine-grained gene-coexpression hypergraph study
could further enhance our understanding of single-cell dataset
analysis. Furthermore, subsequent research could explore the dynamic
construction of temporal GRNs on chronological single-cell data,
drawing upon the foundational principle of simultaneously considering
cellular heterogeneity and gene modules, as demonstrated in this work.
Overall, HyperG-VAE provides a competitive solution for GRN
construction and related downstream works. By combining cell-specific
GRN inference, hypergraph-based gene module identification, and
integrated cell-gene latent representations, HyperG-VAE delivers
biologically relevant insights that extend beyond traditional GRN
methods. It provides researchers with powerful tools to explore
cell-specific regulatory mechanisms, identify gene modules, and
generate testable predictions, thereby advancing our understanding of
complex biological systems.
Limitations of the study
Our HyperG-VAE leveraging the self-attention mechanism has undeniably
propelled models to achieve remarkable
performance.[191]^68^,[192]^69^,[193]^70^,[194]^71 However, despite its
prowess, self-attention-based models still have inherent limitations.
Specifically, the self-attention’s quadratic complexity concerning
sequence length presents challenges. For sequences of length N, it
necessitates
[MATH: O(N2) :MATH]
computations, rendering it computationally demanding and memory
inefficient, especially for longer sequences. Future efforts to address
this limitation will explore adapting the techniques of attention
matrix sparse factorization and positive orthogonal random features, as
demonstrated in studies,[195]^72^,[196]^73 to ease computational
demands.
Resource availability
Lead contact
Requests for resources of this article and any additional information
should be directed to the lead contact, Wenjie Zhang
(wenjie.zhang@unsw.edu.au).
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
All datasets used in the work have been summarized in the [197]key
resources table.
* •
All original code has been deposited at
[198]https://github.com/guangxinsuuu/HyperG-VAE and is publicly
available at [199]https://doi.org/10.5281/zenodo.15028720 as of the
date of publication.
* •
Any additional information required to reanalyze the data reported
in this paper is available from the [200]lead contact upon request.
Acknowledgments