Graphical abstract

   graphic file with name fx1.jpg
   [37]Open in a new tab

Highlights

     * •
       Hypergraph modeling enhances scRNA-seq representation by reducing
       sparsity
     * •
       Capturing cellular heterogeneity and gene modules uncovers complex
       GRN relationships
     * •
       HyperG-VAE surpasses benchmarks in predicting GRNs and identifying
       key regulators
     * •
       Tested on B cells, HyperG-VAE excels in gene regulation,
       clustering, and lineage tracing

Motivation

   Gene regulatory networks (GRNs) derived from single-cell RNA sequencing
   (scRNA-seq) data provide insights into the complex interactions between
   transcription factors (TFs) and target genes. This capability enables a
   detailed understanding of gene expression regulation and cellular
   function across diverse populations. However, addressing both cellular
   heterogeneity and gene modules remains a significant challenge, as
   current GRN inference methods struggle to bridge this gap. To address
   this limitation, we introduce hypergraph variational autoencoder
   (HyperG-VAE), an algorithm based on hypergraph representation learning
   that captures latent correlations among genes and cells, enhancing the
   imputation of contact maps.
     __________________________________________________________________

   Su et al. present HyperG-VAE, a hypergraph-based model integrating
   cellular heterogeneity and gene modules for robust GRN inference from
   scRNA-seq data. By capturing latent gene-cell correlations, it
   overcomes sparsity, outperforms existing methods, and reveals key gene
   regulation patterns with robust downstream analysis.

Introduction

   Gene regulatory networks (GRNs) within single-cell RNA sequencing
   (scRNA-seq) datasets present a sophisticated interplay of transcription
   factors (TFs) and target genes, uniquely capturing the modulation of
   gene expression and thereby delineating the intricate cellular
   functions and responses within diverse cell populations.[38]^1 GRNs
   illuminate core biological processes and underpin applications from
   disease modeling to therapeutic design,[39]^2^,[40]^3^,[41]^4
   empowering researchers to interpret the mechanisms of gene interactions
   within cells and leverage this understanding for medical and
   biotechnological innovations.[42]^5^,[43]^6

   Numerous methodologies have emerged for inferring GRNs from single-cell
   transcriptomic data. The algorithms emphasize co-expression networks in
   a statistical way (e.g., PPCOR[44]^7 and LocaTE[45]^8) or aim to
   decipher causal relationships between TFs and their target genes based
   on the analysis of the gene interactions among cells (e.g.,
   DeepSEM[46]^9 and PIDC[47]^10). Despite their achievements, these
   algorithms still have inherent limitations. Specifically, these
   approaches mainly focus on cellular heterogeneity and overlook the
   critical importance of simultaneously considering cellular
   heterogeneity and gene module information in the model design.
   Generally, from the view of underlying principles, we can divide the
   methodologies into deep learning methods and traditional statistical
   algorithms. Many deep learning-based (e.g., DeepTFni[48]^11 and
   DeepSEM[49]^9) methodologies primarily build upon foundational
   models.[50]^12^,[51]^13 The frequent oversight in these models is the
   inherent relationships between cells and genes, as informed by domain
   expertise. This often leads to models that compromise on explainability
   and narrow their application scope. For the traditional statistical
   algorithms, such as Bayesian networks[52]^14^,[53]^15 and ensemble
   methods,[54]^16^,[55]^17^,[56]^18 it can be computationally expensive,
   and it remains a challenge to extend these methodologies to encompass
   broader nonlinear paradigms.

   Additionally, the scRNA-seq data are frequently marred by noise and
   incompleteness, attributable to phenomena such as amplification biases
   inherent to reverse transcription and PCR amplification
   processes,[57]^19^,[58]^20 as well as the issue of low quantities of
   nucleic acids in single cells. To get a more robust GRN, several
   methodologies[59]^21^,[60]^22 leverage multi-omic datasets, capturing
   different kinds of cellular information to enrich the model’s
   comprehensiveness. However, integrating multi-omic datasets presents
   substantial challenges, particularly regarding harmonizing data from
   disparate sources and platforms, and could also introduce additional
   noise.[61]^23

   To address the problems and construct a reliable GRN, we model
   scRNA-seq data as a hypergraph and present hypergraph variational
   autoencoder (HyperG-VAE), a Bayesian deep generative model to process
   the hypergraph data. Distinct from current approaches, HyperG-VAE
   simultaneously captures cellular heterogeneity and gene modules (in GRN
   analysis, gene modules refer to clusters of genes that are regulated
   together by the same set of TFs) through its cell and gene encoders
   individually during the GRN construction. Two encoders employ
   variational inference to learn stochastic representations of genes and
   cells, offering a more flexible and robust approach to managing
   real-world data complexities. This could be particularly effective in
   handling noise in scRNA-seq datasets, a capability that has been
   demonstrated in previous studies.[62]^24^,[63]^25^,[64]^26 Within a
   shared embedding space, the dual encoders of our model interact,
   boosting its cohesiveness. The joint optimization manner elucidates
   gene regulatory mechanisms within gene modules across various cell
   clusters, thereby augmenting the model’s ability to delineate complex
   gene regulatory interactions and significantly improving its
   explainability.

   Our study evaluates the performance of HyperG-VAE in various scRNA-seq
   applications. These include (1) GRN inference, (2) cell embedding, (3)
   gene embedding, and (4) gene regulation hypergraph construction.
   Through benchmark comparisons, encompassing tasks like GRN inference,
   data visualization, and single-cell clustering, we establish that
   HyperG-VAE outperforms existing state-of-the-art methods. Additionally,
   HyperG-VAE demonstrates its utility in elucidating the regulatory
   patterns governing B cell development in bone marrow. Our model also
   excels in learning gene expression modules and cell clusters, which
   connect the gene encoder and cell encoder individually to boost gene
   regulatory hypergraph prediction. This integrated functionality of
   HyperG-VAE improves our comprehension of single-cell transcriptomic
   data, ultimately providing better insights into the realm of GRN
   inference.

Results

Framework overview

   We introduce HyperG-VAE, a Bayesian deep generative model specifically
   designed to address the complex challenge of gene regulation network
   inference using scRNA-seq data, which are represented as a hypergraph
   ([65]Figure 1; [66]STAR Methods). Our HyperG-VAE takes into account the
   interplay between gene modules and cellular heterogeneity, allowing for
   a more accurate representation of cell-specific regulatory mechanisms.
   This interplay could be incorporated into a hypergraph to capture the
   nuanced interactions of genes across diverse cellular states.

Figure 1.

   [67]Figure 1
   [68]Open in a new tab

   Overview of HyperG-VAE

   (A) HyperG-VAE, which takes the expression value matrix derived from
   scRNA-seq data as input. In the provided table, four cells exhibit
   expression across fifteen genes, with color gradients indicating
   varying gene expression levels (white circles mean no expression).

   (B) The colored circles with serial numbers denote distinct genes,
   expressed within specific cells, functioning as interlinked nodes.
   These nodes are interconnected by a singular hyperedge (small dashed
   ellipses) symbolizing the cell (triangle). Together, these nodes and
   hyperedges form a hypergraph structure. Node coloration reflects a
   composite of gene expression levels of the given gene across cells; for
   instance, gene 3 manifests a blend of green and blue hues. The largest
   dashed ellipse is the genome shared by all cells.

   (C) The neural network architecture of HyperG-VAE, where two encoders
   are designed to process the provided input matrix. The cell encoder
   uses the structural equation model (SEM) to discern cellular
   heterogeneity and form the GRN, while the gene encoder, employing a
   hypergraph self-attention mechanism, focuses on gene module analysis.
   The decoder subsequently reconstructs the input matrix, leveraging the
   shared latent space of both gene and cell embeddings. The inferred gene
   regulation hypergraph integrates cellular and gene representations,
   drawing on relationships derived from the learned GRN.

   (D–G) Downstream tasks that can be pursued by HyperG-VAE include GRN
   construction, clustering both cells and genes, and modeling the
   interplay between gene modules and cellular heterogeneity. Further
   details are provided in the legend, located in the top right corner.

   In the context of hypergraphs, we construct the hypergraph by
   representing cells as individual hyperedges, with the genes expressed
   in each cell serving as the nodes included within those hyperedges
   ([69]Figures 1A and 1B). Specifically, let
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-script">V</mi></msup><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><mi>m</mi><mo>×</mo><mi>n</mi><
   /mrow></msup></mrow> :MATH]
   denote the scRNA-seq expression matrix, where m is the number of cells
   and n is the number of genes. The incidence matrix
   [MATH: <mrow><mi mathvariant="bold-italic">M</mi><mo
   linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mrow><mo>{</mo><mrow><mn>0</mn><mo>
   ,</mo><mn>1</mn></mrow><mo>}</mo></mrow><mrow><mi>m</mi><mo>×</mo><mi>n
   </mi></mrow></msup></mrow> :MATH]
   encodes the hypergraph structure: if a gene (node) i is expressed in
   cell (hyperedge) j (
   [MATH: <mrow><msubsup><mi
   mathvariant="bold-italic">H</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi
   mathvariant="bold-script">V</mi></msubsup><mo linebreak="goodbreak"
   linebreakstyle="after">></mo><mn>0</mn></mrow> :MATH]
   ), then
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">M</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub
   ><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>1</mn></mrow> :MATH]
   . This construction captures the relationship between cells and their
   expressed genes, enabling the sparse scRNA-seq data to be effectively
   represented as a hypergraph.

   HyperG-VAE incorporates two encoders, a cell encoder and a gene
   encoder, enabling it to learn the hypergraph representation
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-script">V</mi></msup></mrow> :MATH]
   ([70]Figure 1C) with structure
   [MATH: <mrow><mi mathvariant="bold-italic">M</mi></mrow> :MATH]
   . The cell encoder generates cell representations
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-script">E</mi></msup></mrow> :MATH]
   in the form of hypergraph duality, facilitating the embedding of
   high-order relations via structural equation layers. GRN construction
   ([71]Figure 1D) is realized in this structural equation layer through a
   learnable causal interaction matrix. In addition, the cell encoder can
   adeptly capture the gene regulation process in a cell-specific manner,
   elucidating a clearer landscape of cellular heterogeneity
   ([72]Figure 1E). The gene encoder is specifically designed to process
   observed gene representations, denoted as
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-script">V</mi></msup></mrow> :MATH]
   . Given that genes within a module generally manifest consistent
   expression profiles across cells, we employ a multi-head self-attention
   mechanism that is specifically designed for the hypergraph in this
   work. This not only discerns varying gene expression levels but also
   assigns appropriate weights to the genes expressed in the same cell
   during the message-passing phase. Thus, the gene encoder enhances the
   model’s ability to understand and integrate the intricate
   interdependencies among genes, thereby aiding in the effective
   embedding of gene clusters ([73]Figure 1F). Finally, a hypergraph
   decoder is utilized to reconstruct the original topology of the
   hypergraph ([74]Figure 1G) using the learned latent embedding of genes
   and cells. Utilizing the reconstructed hypergraph and the learned
   inter-gene relationships, we can also infer a gene regulatory
   hypergraph ([75]Figure 1G). This hypergraph encompasses gene regulatory
   modules that span across various cell stages.

   HyperG-VAE enhances GRN inference by incorporating the above two
   encoders to mutually augment each other’s embedding quality
   ([76]Figure 1C) while preserving the high-order gene relations among
   cells, constrained by hypergraph variational evidence lower bound
   ([77]STAR Methods). Specifically, the cell encoder incorporates a
   structure equation model (SEM) on gene co-expression space to infer the
   GRNs; the learning of gene modules by the gene encoder aids in the
   inference of GRNs since the gene module conceivably incorporates
   TF-target regulation patterns. By integrating the embedding of genes
   and cells through joint learning, we observe the substantial
   performance of downstream tasks ([78]Figures 1D–1G), including the
   inference of GRNs, cell clustering, gene clustering, and interplay
   characterization between gene modules and cellular heterogeneity, among
   others.

HyperG-VAE achieves accurate prediction of GRNs

   We evaluate the performance on GRN inference of HyperG-VAE based on the
   setting of the BEELINE framework.[79]^27 Our evaluation encompassed
   seven scRNA-seq datasets. This includes two cell lines from humans and
   five mouse cell lines (more details can be found in the
   [80]supplemental information). We evaluate GRN performance using two
   metrics: EPR, which assesses the enrichment of true positives among the
   top K predicted edges relative to random predictions, and AUPRC, which
   measures the area under the precision-recall curve to account for class
   imbalance. These metrics are applied across four types of ground-truth
   datasets: STRING,[81]^28 non-specific chromatin immunoprecipitation
   (ChIP)-seq,[82]^29^,[83]^30^,[84]^31 cell-type-specific
   ChIP-seq,[85]^32^,[86]^33^,[87]^34 and loss-/gain-of-function (LOF/GOF)
   networks.[88]^34 As recommended by Pratapa et al.,[89]^27 our analysis
   for each dataset prioritized the most variable TFs and the top N
   most-varying genes, where N is set to 500 and
   [MATH: <mrow><mn>1</mn><mo>,</mo><mn>000</mn></mrow> :MATH]
   . We selected seven state-of-the-art baseline algorithms based on the
   evaluation of BEELINE to compare with HyperG-VAE: DeepSEM,[90]^9
   GENIE3,[91]^17 PIDC,[92]^10 GRNBoost2,[93]^18 SCODE,[94]^35
   ppcor,[95]^7 and SINCERITIES.[96]^36

   Overall, HyperG-VAE demonstrates a discernible enhancement in
   performance when compared with other baseline methods in terms of both
   AUPRC and EPR metrics ([97]Figures 2 and [98]S2). For scaled results of
   datasets composed of all significantly varying TFs and the 500
   most-varying genes (as shown in [99]Figure 2), HyperG-VAE surpasses the
   seven other benchmarked methods in 40 of the 44 (
   [MATH: <mrow><mn>91</mn><mo>%</mo></mrow> :MATH]
   ) evaluated conditions. Compared with the second-best method (DeepSEM),
   HyperG-VAE enhances the results by at least
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   in 19 out of the 44 benchmarks. Furthermore, in comparison to other
   commendable approaches, such as PIDC and GENIE3, our approach
   registered significant enhancements. For PIDC, 38 out of 44 instances
   showed improvements of over
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   , with 27 surpassing
   [MATH: <mrow><mn>30</mn><mo>%</mo></mrow> :MATH]
   and 22 going beyond
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   . Similarly, with GENIE3, 33 out of 44 instances marked at least a
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   enhancement, 26 surpassed
   [MATH: <mrow><mn>30</mn><mo>%</mo></mrow> :MATH]
   , and an impressive 20 recorded at least a
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   increase. For results of datasets composed of all significantly varying
   TFs and the 1000 most-varying genes ([100]Figure S2), HyperG-VAE
   achieves the best prediction performance on
   [MATH: <mrow><mn>84</mn><mo>%</mo></mrow> :MATH]
   (
   [MATH: <mrow><mn>37</mn><mo linebreak="goodbreak"
   linebreakstyle="after">/</mo><mn>44</mn></mrow> :MATH]
   ) of the benchmarks. In comparison to the runner-up method, DeepSEM,
   HyperG-VAE outperforms by a margin of at least
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   in 17 of the 44 evaluated benchmarks. Notably, the average enhancement
   in EPR stands at
   [MATH: <mrow><mn>11.35</mn><mo>%</mo></mrow> :MATH]
   , while that in AUPRC is
   [MATH: <mrow><mn>7.16</mn><mo>%</mo></mrow> :MATH]
   .

Figure 2.

   [101]Figure 2
   [102]Open in a new tab

   Benchmarks of different GRN inference methods on experimental scRNA-seq
   datasets by EPR and AUPRC scores

   The performance of HyperG-VAE is contrasted against seven alternative
   algorithms across seven datasets. Each dataset comprises all
   significantly varying transcription factors (TFs) and the 500
   most-varying genes. These evaluations are based on four distinct
   ground-truth benchmarks: non-specific ChIP-seq, STRING,
   cell-type-specific ChIP-seq, and LOF/GOF. For each figure pair, the
   left image depicts the median EPR results, while the right image shows
   the median AUPRC outcomes. Results inferior to random predictions are
   excluded from the visualizations for clarity. The color scale in each
   dataset is normalized between 0 and 1 using a min-max scaling approach.
   EPR is defined as the odds ratio of true positives among the top K
   predicted edges, where K represents the number of edges in the
   ground-truth GRN, compared to random predictions. Similarly, the AUPRC
   ratio reflects the odds ratio of the AUPRC value between the model and
   random predictions.

   With single-cell sequencing data, robustly inferring GRNs from limited
   cells is pivotal, especially for capturing rare cellular phenotypes and
   transient states.[103]^9^,[104]^11 Here, we explore the fluctuations in
   EPR performance and the robustness of HyperG-VAE when confronted with
   limited training data ([105]Figure S3A). We constructed mouse embryonic
   stem cell (mESC) datasets[106]^37 composed of all significantly varying
   TFs and the 500 and 1,000 most-varying genes, respectively, and
   evaluated the accuracy based on four unique ground-truth benchmarks by
   randomly subsampling single cells following the BEELINE
   benchmark.[107]^27 Upon adjusting the number of subsampled single cells
   to 400, 300, 200, 100, and 50, we registered average performance
   retentions of
   [MATH: <mrow><mn>94</mn><mo>%</mo></mrow> :MATH]
   ,
   [MATH: <mrow><mn>92</mn><mo>%</mo></mrow> :MATH]
   ,
   [MATH: <mrow><mn>91</mn><mo>%</mo></mrow> :MATH]
   ,
   [MATH: <mrow><mn>80</mn><mo>%</mo></mrow> :MATH]
   , and
   [MATH: <mrow><mn>53</mn><mo>%</mo></mrow> :MATH]
   , respectively. Remarkably, when training with cell counts exceeding
   100, a robust
   [MATH: <mrow><mn>79.17</mn><mo>%</mo></mrow> :MATH]
   (
   [MATH: <mrow><mn>19</mn><mo linebreak="goodbreak"
   linebreakstyle="after">/</mo><mn>24</mn></mrow> :MATH]
   ) retained more than
   [MATH: <mrow><mn>90</mn><mo>%</mo></mrow> :MATH]
   of their performance, and for counts greater than 50, a compelling
   [MATH: <mrow><mn>87.50</mn><mo>%</mo></mrow> :MATH]
   (
   [MATH: <mrow><mn>28</mn><mo linebreak="goodbreak"
   linebreakstyle="after">/</mo><mn>32</mn></mrow> :MATH]
   ) maintained above
   [MATH: <mrow><mn>80</mn><mo>%</mo></mrow> :MATH]
   efficacy. When utilizing cell-type-specific ChIP-seq as the benchmark,
   the performance remains notably stable, with an average performance
   retention of 93%. Furthermore, when assessed against the other
   three ground truths and the training cell count exceeds 50, there is
   only a modest decline in efficacy, averaging
   [MATH: <mrow><mn>88</mn><mo>%</mo></mrow> :MATH]
   performance retention in comparison to the median value derived from
   all cells. Beyond performance evaluation, we also examined HyperG-VAE’s
   scalability with expansive datasets ([108]Figure S3B).

HyperG-VAE reveals the gene regulation patterns of B cell development in bone
marrow

   To evaluate HyperG-VAE’s proficiency in elucidating GRNs and to assess
   the effectiveness of both cell clustering embedding and gene module
   embedding components within HyperG-VAE, we deployed HyperG-VAE on
   scRNA-seq data of B cell development in bone marrow[109]^38 (more
   details of the data can be found in the [110]STAR Methods and
   [111]Table S3), as illustrated in [112]Figure 3. The progression of B
   cell development from hematopoietic stem cells follows a sequential yet
   adaptable developmental pathway governed by interactions among
   environmental stimuli, signaling cascades, and transcriptional
   networks.[113]^39 Throughout this developmental trajectory, TFs play a
   pivotal role in regulating the cell cycle, differentiation, and
   advancement to subsequent developmental stages. These critical
   checkpoints encompass the initial commitment to lymphocytic
   progenitors, the specification of pre-B cells, progression through
   immature stages, entry into the peripheral B cell pool, B cell
   maturation, and subsequent differentiation into plasma cells.[114]^40
   Each of these regulatory nodes is controlled by complex transcriptional
   networks, which, along with sensing and signaling systems, determine
   the final outcomes.

Figure 3.

   [115]Figure 3
   [116]Open in a new tab

   GRN prediction by HyperG-VAE across developmental B cell states in bone
   marrow

   (A) t-distributed stochastic neighbor embedding (t-SNE) visualization
   of cell embedding on the bone marrow B cell dataset; the embedding is
   learned by the cell encoder of HyperG-VAE. Black lines depict the
   trajectory from pre-pro-B cells to mature B cells.

   (B) Heatmap/dot plot showing TF expression of the regulon on a color
   scale and cell-type specificity (RSS [regulon specificity score]) of
   the regulon on a size scale.

   (C) The accuracy of GRN prediction by cross-validation with publicly
   available ChIP-seq datasets. The overlap coefficient quantifies the
   concordance between sets of target genes for each TF, as derived from
   GRN prediction and ChIP-seq database, respectively. The x axis
   represents the difference value of overlap coefficients between
   HyperG-VAE and SCENIC (default). Pink lines indicate superior
   performance by HyperG-VAE, while blue lines favor the default SCENIC.
   The dot plot illustrates the overlap coefficient of the more effective
   approach for each regulon, depicted on a color gradient. Cell states
   are arranged in a sequence that reflects the progression of bone marrow
   B cell development stages.

   (D) The GRN visualization for the bone marrow B cell dataset with ten
   states from pre-pro-B state to plasma state, as delineated by
   HyperG-VAE; the inner circle shows the co-binding of shared target
   genes, while the outer circle presents TF-focused target genes.

   HyperG-VAE uncovers the cell embedding by dimensionality reduction and
   distinctly segregates the primary cell types across various stages of
   bone marrow B cell development ([117]Figure 3A). Significantly,
   HyperG-VAE also effectively captures the linear progression of B cell
   development, spanning from early pro-B, late pro-B, large pre-B, small
   pre-B, and immature B to mature B cells. In our pursuit to unveil the
   gene regulation patterns in developmental B cells, our HyperG-VAE, in
   conjunction with SCENIC,[118]^41 successfully identifies established
   master regulators associated with different developmental stages
   ([119]Figures 3B and [120]S4), including pre-pro-B (Runx2), pro-B (Ebf1
   and Lef1), large pre-B (Myc and Hmgb2), small pre-B (Tcf3 and Sox4),
   immature B (Relb and Egr1), mature B (Nfkb2), and plasma (Cebpb and
   Prdm1) cells.

   Furthermore, we conducted a benchmark assessment to compare the
   performance of HyperG-VAE against SCENIC using its default settings.
   Using the ChIP-seq database,[121]^33 the accuracy was evaluated based
   on the degree of overlap coefficient between the ChIP-seq coverage and
   the predicted target genes from both methods. Our HyperG-VAE, when
   combined with SCENIC, demonstrates superior performance compared to the
   standard SCENIC approach, exhibiting higher accuracy in detecting
   TF-target patterns for the key TFs (as illustrated in [122]Figure 3C).
   The comprehensive gene regulation network spanning the developmental B
   cells in the bone marrow is depicted in [123]Figure 3D. We find that
   the GRNs show TF-target regulation patterns in two ways: TFs co-binding
   to shared predicted enhancers (the inner circle in [124]Figure 3D) and
   TF-specific target genes (the outer circle in [125]Figure 3D). We also
   observe that the cooperativity between TFs is stronger within cell
   types along the development path, indicating that some TFs are involved
   in multiple stages of B cell development.

Gene expression module learning enhances HyperG-VAE in GRN inference

   Our HyperG-VAE model augments the GRN prediction by integrating gene
   space learning, as depicted in [126]Figure 4C. HyperG-VAE uncovers the
   gene expression modules visualized by uniform manifold approximation
   and projection (UMAP)[127]^42 in [128]Figure 4A. By associating these
   gene modules with the key TFs and corresponding target genes of
   pathways along B cell development, we annotate the gene modules with
   specific cell types, indicating that these gene clusters are activated
   in different stages of developmental B cells ([129]Figures 4A and 4B).

Figure 4.

   [130]Figure 4
   [131]Open in a new tab

   The interplay between gene embedding and cell clusters

   (A) Gene embedding by the gene encoder of HyperG-VAE on developmental B
   cell data. Gene clusters encoded by numbers are associated with
   different cell types by colors.

   (B) The heatmap illustrates normalized overlap values between gene
   clusters and TF regulons from different B cell states. Here, genes
   serve as a bridge to compute the overlap, with lighter colors
   representing larger overlap scores.

   (C) t-SNE visualization of cellular embeddings with highlighted
   pre-BCRi B cell state, together with the associated regulon, Phf8_(+),
   and related target genes.

   (D) Pathway enrichment analysis on gene cluster 5 with associated
   molecular complex detection (MCODE) network components.

   (E) Pathway enrichment analysis of different gene clusters. The shaded
   pathways show the dominant gene programs for each gene cluster.

   We further apply gene set enrichment analysis (GSEA)[132]^43 ([133]STAR
   Methods) to investigate the gene clusters ([134]Figures 4E and [135]S1;
   [136]Tables S5 and [137]S6). The pathways identified through GSEA
   validate the accuracy of our gene cluster annotations. For example,
   large pre-B cells (cluster pre-BCRi [B cell receptor independent] B)
   are associated with signals initiating diverse processes, which include
   proliferation and recombination of the light-chain gene[138]^44; the
   GSEA results show the related pathways: lymphocyte proliferation, cell
   activation, and B cell receptor signaling pathway. Immature B cells
   exhibit B cell central tolerance, which is governed by mechanisms such
   as receptor editing and apoptosis.[139]^45 The pathways identified in
   the corresponding gene clusters include antigen processing and
   presentation of exogenous peptide antigen, DNA damage response,
   regulation of cell killing, and apoptotic signaling pathways. Plasma
   cells are terminally differentiated B lymphocytes that secrete
   immunoglobulins, also known as antibodies.[140]^46 Considering the
   substantial demands placed on these cells for secretory biological
   processes, the pathways associated with the relevant gene cluster shed
   light on the cellular response to endoplasmic reticulum stress.

   We show that the gene modules are associated with different biological
   pathways during B cell development in the bone marrow. These gene
   modules implicitly incorporate the gene regulation patterns, leading to
   different cell types. On the other hand, distinct cell types of B cell
   clustering are engaged in various immunological
   environments,[141]^39^,[142]^47 resulting in different signaling
   pathways for B cell activation and fate decisions. We exemplify this
   joint relationship with an example involving B cells at the large pre-B
   stage, as shown in [143]Figures 4C and 4D. This specific cell state
   ([144]Figure 4C) is characterized by gene regulation patterns
   associated with cell proliferation, reflected by the regulon
   Phf8(+).[145]^48 The corresponding gene cluster ([146]Figure 4D) is
   linked to a molecular complex detection (MCODE) network, which belongs
   to the lymphocyte proliferation pathway ([147]Figure 4E) and shares
   target genes with the Phf8(+) regulon.

   Therefore, our HyperG-VAE reciprocally integrates these two concepts,
   cell clustering and gene module detection, with the aim of revealing
   GRNs. Concretely, the cell embedding process groups together similar
   cells that share common pathways, while the gene modules aggregate
   genes exhibiting similar regulation patterns, thereby enhancing the
   accuracy of GRN computations.

HyperG-VAE constructs the cell-type-specific GRN on B cell development in
bone marrow

   We have demonstrated that gene modules associated with various
   biological pathways correspond to distinct cell types within bone
   marrow development in B cells. Essentially, these distinct gene
   regulation patterns influence cell fate commitment, leading to the
   development of diverse cell types with varying gene regulation
   profiles. Thus, we employ HyperG-VAE to investigate each individual
   state of developmental B cells and construct a more accurate GRN for B
   cells at specific developmental stages, as illustrated in
   [148]Figure 5. B cell development in the bone marrow can be broadly
   categorized into four states: pro-B, large pre-B, small pre-B, and
   mature B.[149]^40 These four stages are visualized using UMAP, as
   depicted in [150]Figure 5B. For each of these states, we employed
   HyperG-VAE to compute GRNs and uncover the predominant regulatory
   patterns, as illustrated in [151]Figures 5A–5C. HyperG-VAE effectively
   reveals the key TFs and their associated target genes within each cell
   state. For example, in the pro-B state, Ebf1[152]^49 and Pax5[153]^50
   play significant roles, while Myc[154]^38 stands out in the large pre-B
   state; Bach2[155]^38^,[156]^51 is crucial in the small pre-B state, and
   Klf2[157]^52 and Ctcf[158]^53 are notable in the mature state.

Figure 5.

   [159]Figure 5
   [160]Open in a new tab

   Cell-type-specific GRN analysis of developmental B cells in bone marrow

   (A) The Sankey diagram shows significant regulons and corresponding
   target genes of different states along B cell development in bone
   barrow, with the normalized enrichment score (NES) encoded by color
   shade and the area under the curve (AUC) score by dot size.

   (B) Gene regulatory hypergraph at the cell clustering level,
   illustrating the four principal B cell states as four hyperedges.
   Conserved TFs are highlighted with red dots, and target genes are
   depicted as diamonds, where size reflects the log fold change (logFC)
   in gene expression of a given state compared to others. Highly
   expressed genes are labeled in the figure.

   (C) The motif of significant TFs along the principal stages. Additional
   motif details can be found in [161]Figure S5.

   (D) Heatmap displays the expression of the top genes, ranked by logFC,
   across cells classified into four distinct cell states. The genes are
   selected by the overlap of top logFC genes and predicted target genes.
   The genes’ color corresponds to the cell states in which the regulation
   pattern is predicted.

   (E) Volcano plot of differentially expressed genes of different states.
   The blue inverted triangles denote downregulated genes, and the red
   triangles denote upregulated genes.

   The aforementioned TFs, along with their respective target genes,
   collectively constitute the regulons that characterize the four major
   cell states, allowing for the construction of a gene regulatory
   hypergraph at the cell clustering level ([162]Figures 5A and 5B). For
   each major state, we overlap the top-predicted target genes by
   HyperG-VAE ([163]Figure 5B) with the differentially expressed genes
   (DEGs; [164]Figure 5E) and identify the principal marker genes
   ([165]Figure 5D). Specifically, Ebf1 and Pax5 are essential in the
   pro-B state of bone marrow to maintain an early B cell phenotype
   characterized by the expression of B cell-specific genes such as Vpreb
   and Igll1 for surrogate light-chain production.[166]^49^,[167]^50 In
   the large pre-B stage, the enriched regulons encompass the TF
   Myc[168]^38 and other genes related to the cell cycle, such as Mki67,
   Cenpf, Cenpa, and Hmgb2. Additionally, nucleosome-related genes, such
   as Hist1h2ae and Hist1h3c, are also enriched in this state due to the
   high rate of cell proliferation. In the small pre-B stage, both Bach2
   and Btg1 restrain cell proliferation.[169]^54^,[170]^55 It is
   noteworthy that the mature state markers H2-Ab1, H2-Eb1, H2-Aa, and
   Cd74 are assigned as target genes in the pro-B stage, suggesting that
   these genes may be actively repressed in the early B cell development
   stage.

HyperG-VAE addresses cellular heterogeneity and learns the cell
representation

   Cellular heterogeneity is a hallmark of complex biological systems,
   manifesting as diverse cell types and states within scRNA-seq
   datasets.[171]^56 We hypothesize that the latent space inferred by the
   cell encoder of HyperG-VAE captures this biological variability among
   cells. Leveraging domain expertise, we can map these clusters to known
   cell types or states, ensuring that the computational predictions align
   with manual inspection and annotation. To evaluate the performance, we
   applied HyperG-VAE to three biologically relevant scRNA-seq datasets,
   including an Alzheimer’s disease (AD) dataset,[172]^57 a colorectal
   cancer dataset,[173]^58 and the widely used mouse brain dataset, known
   as the Zeisel dataset[174]^59 (more dataset details can be found in
   [175]Table S4). To benchmark HyperG-VAE, we also compared its
   low-dimensional embeddings with those of six other algorithms:
   autoCell,[176]^60 DCA,[177]^61 scVI,[178]^62 DESC,[179]^63
   SAUCIE,[180]^64 and scVAE.[181]^65 We followed the Louvain
   algorithm[182]^66 to cluster all the single cells into an identical
   number of clusters for each method ([183]STAR Methods). To assess the
   precision of clustering against established reference labels, we
   employed four metrics: the adjusted Rand index (ARI), normalized mutual
   information (NMI), homogeneity (HOM), and completeness (COM). These
   metrics span a scale from 0, indicating random clustering, to 1,
   signifying perfect alignment with reference clusters, with superior
   values indicating enhanced accuracy.

   Overall, the performance of HyperG-VAE surpasses that of its
   counterparts, as evidenced in [184]Figure 6A. Specifically, for the
   Zeisel dataset, the clusters generated using HyperG-VAE align more
   closely with the existing cell type annotations, registering an NMI of
   [MATH: <mrow><mn>83.1</mn><mo>%</mo></mrow> :MATH]
   and an ARI of
   [MATH: <mrow><mn>83.7</mn><mo>%</mo></mrow> :MATH]
   . In comparison, the next best-performing algorithm, autoCell, recorded
   an NMI of
   [MATH: <mrow><mn>78.0</mn><mo>%</mo></mrow> :MATH]
   and an ARI of
   [MATH: <mrow><mn>80.6</mn><mo>%</mo></mrow> :MATH]
   . Furthermore, we evaluated HyperG-VAE’s latent space to determine its
   ability to capture the biological diversity among individual cells in
   the Zeisel dataset, as illustrated in [185]Figure 6B. We visualized the
   data embedding by UMAP. In previous sections, we showed that HyperG-VAE
   effectively identifies gene expression differences within cells of the
   same type. The compact UMAP highlights distinct clusters while
   preserving intra-cluster heterogeneity, demonstrating its ability to
   capture both inter- and intra-cluster variability. Compared to other
   algorithms, the distinct separation observed with HyperG-VAE across
   most clusters indicates effective clustering, suggesting that
   HyperG-VAE’s cell encoder adeptly distinguishes between various cell
   states or types. While algorithms such as autoCell, scVI, and scVAE
   have achieved results that are comparable, the differentiation between
   their clusters is not as pronounced as with HyperG-VAE. For the
   remaining algorithms, the substantial overlap among clusters hinders
   the classifier from producing optimal results. Specifically, compared
   to other methodologies founded on conventional single-layer VAEs, the
   enhanced visualization capabilities of HyperG-VAE underscore the
   potential benefits of incorporating gene modules in cell embedding
   processes.

Figure 6.

   [186]Figure 6
   [187]Open in a new tab

   Benchmarks of single-cell clustering and embedding

   (A) The cell clustering performance of HyperG-VAE on the single-cell
   datasets compared with six baseline methods on four key metrics: NMI,
   ARI, COM, and HOM. NMI, normalized mutual information (the higher the
   value, the better); ARI, adjusted rand index (the higher the value, the
   better); COM, completeness (the higher the value, the better); HOM,
   homogeneity (the higher the value, the better).

   (B) UMAP visualization of latent representations on the Zeisel dataset
   for different methods. The black circles highlight areas with ambiguous
   classification boundaries.

Discussion

   In this work, we introduce HyperG-VAE, a sophisticated model designed
   for the construction of GRNs. Uniquely, HyperG-VAE leverages a
   hypergraph framework, wherein genes expressed within individual cells
   are represented as nodes connected by distinct hyperedges, capturing
   the latent gene correlations among single cells. As a key algorithmic
   innovation of HyperG-VAE, the transformation of scRNA-seq data into a
   hypergraph offers unique advantages compared to existing GRN inference
   methods. These advantages include improved modeling of cellular
   heterogeneity, enhanced analysis of gene modules, increased sensitivity
   to gene correlations among cells, and improved visualization and
   interpretation of GRNs. This direct use of a hypergraph, as opposed to
   traditional pairwise methods like star expansion (SE) and clique
   expansion (CE),[188]^67 captures complex multi-dimensional
   relationships more effectively, avoiding the increased complexity and
   information loss associated with SE and CE. By maintaining the
   hypergraph’s original form, HyperG-VAE preserves the data’s full
   complexity and integrity, enhancing analytical depth and reducing
   computational demands.

   In addition to modeling scRNA-seq data in a hypergraph, HyperG-VAE
   effectively integrates gene modules and cellular heterogeneity,
   demonstrating superior performance compared to existing methods. On the
   one hand, our study reveals that HyperG-VAE outperforms related
   existing state-of-the-art algorithms in GRN inference, cell type
   classification, and visualization tasks, respectively, as evidenced by
   its enhanced performance across several widely recognized benchmark
   datasets. On the other hand, we also utilize HyperG-VAE on scRNA-seq
   data of B cell development in bone marrow[189]^38 to evaluate its
   performance in a biologically relevant context. Firstly, HyperG-VAE
   achieves accurate prediction of GRNs and successfully identifies key
   master regulators and target genes across different developmental
   stages. Meanwhile, we cross-validated our results with publicly
   available ChIP-seq datasets,[190]^33 further demonstrating HyperG-VAE’s
   robust performance in predicting regulons based on GRN inference.
   Secondly, subsequent evaluations across various tasks further
   highlighted the effectiveness of HyperG-VAE’s carefully designed
   encoder components, with their synergistic interaction significantly
   bolstering the model’s overall performance. Specifically, the cell
   encoder within HyperG-VAE predicts the GRNs through a structural
   equation model while also pinpointing unique cell clusters and tracing
   the developmental lineage of B cells; the gene encoder uncovers gene
   modules that implicitly encapsulate patterns of gene regulation,
   thereby enhancing the accuracy of GRN predictions. To demonstrate this
   interaction, we highlight the shared genes between gene clusters and
   the predicted target genes within cell clusters. These shared genes are
   notably present in pathways identified by GSEA, signifying the
   connections between gene modules identified by gene encoders and cell
   clusters delineated by cell encoders.

   Our proposed model, HyperG-VAE, holds promise as a foundational
   framework, adaptable to a multitude of biological contexts in future
   research endeavors. A promising future direction is extending
   HyperG-VAE into a heterogeneous hypergraph VAE by incorporating
   additional omics data, such as single-cell ChIP-seq. Integrating
   single-cell ChIP-seq data into HyperG-VAE would enhance GRN
   construction, complementing transcriptomic data and revealing upstream
   regulatory events. This kind of integration would enable seamless
   multi-omics data fusion and further advance GRN inference and cellular
   regulation analysis. Additionally, while the present model does not
   explicitly use metadata for genes and cells, future enhancements that
   integrate these metadata into the hypergraph-centric framework could
   significantly improve the representations of nodes (genes) and
   hyperedges (cells). The weights assigned to these hyperedges can also
   be factored into the model’s learning phase, offering a more
   comprehensive analysis. In the generative phase of HyperG-VAE,
   gene-cell interactions proceed through a cohesive mechanism,
   facilitating the development of a robust GRN underscored by the
   interplay between gene modules and cell clusters. Moreover, advancing
   to a single-cell-level, fine-grained gene-coexpression hypergraph study
   could further enhance our understanding of single-cell dataset
   analysis. Furthermore, subsequent research could explore the dynamic
   construction of temporal GRNs on chronological single-cell data,
   drawing upon the foundational principle of simultaneously considering
   cellular heterogeneity and gene modules, as demonstrated in this work.

   Overall, HyperG-VAE provides a competitive solution for GRN
   construction and related downstream works. By combining cell-specific
   GRN inference, hypergraph-based gene module identification, and
   integrated cell-gene latent representations, HyperG-VAE delivers
   biologically relevant insights that extend beyond traditional GRN
   methods. It provides researchers with powerful tools to explore
   cell-specific regulatory mechanisms, identify gene modules, and
   generate testable predictions, thereby advancing our understanding of
   complex biological systems.

Limitations of the study

   Our HyperG-VAE leveraging the self-attention mechanism has undeniably
   propelled models to achieve remarkable
   performance.[191]^68^,[192]^69^,[193]^70^,[194]^71 However, despite its
   prowess, self-attention-based models still have inherent limitations.
   Specifically, the self-attention’s quadratic complexity concerning
   sequence length presents challenges. For sequences of length N, it
   necessitates
   [MATH: <mrow><mi mathvariant="script">O</mi><mrow><mo
   stretchy="true">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   computations, rendering it computationally demanding and memory
   inefficient, especially for longer sequences. Future efforts to address
   this limitation will explore adapting the techniques of attention
   matrix sparse factorization and positive orthogonal random features, as
   demonstrated in studies,[195]^72^,[196]^73 to ease computational
   demands.

Resource availability

Lead contact

   Requests for resources of this article and any additional information
   should be directed to the lead contact, Wenjie Zhang
   (wenjie.zhang@unsw.edu.au).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       All datasets used in the work have been summarized in the [197]key
       resources table.
     * •
       All original code has been deposited at
       [198]https://github.com/guangxinsuuu/HyperG-VAE and is publicly
       available at [199]https://doi.org/10.5281/zenodo.15028720 as of the
       date of publication.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [200]lead contact upon request.

Acknowledgments