Graphical abstract graphic file with name fx1.jpg [37]Open in a new tab Highlights * • Hypergraph modeling enhances scRNA-seq representation by reducing sparsity * • Capturing cellular heterogeneity and gene modules uncovers complex GRN relationships * • HyperG-VAE surpasses benchmarks in predicting GRNs and identifying key regulators * • Tested on B cells, HyperG-VAE excels in gene regulation, clustering, and lineage tracing Motivation Gene regulatory networks (GRNs) derived from single-cell RNA sequencing (scRNA-seq) data provide insights into the complex interactions between transcription factors (TFs) and target genes. This capability enables a detailed understanding of gene expression regulation and cellular function across diverse populations. However, addressing both cellular heterogeneity and gene modules remains a significant challenge, as current GRN inference methods struggle to bridge this gap. To address this limitation, we introduce hypergraph variational autoencoder (HyperG-VAE), an algorithm based on hypergraph representation learning that captures latent correlations among genes and cells, enhancing the imputation of contact maps. __________________________________________________________________ Su et al. present HyperG-VAE, a hypergraph-based model integrating cellular heterogeneity and gene modules for robust GRN inference from scRNA-seq data. By capturing latent gene-cell correlations, it overcomes sparsity, outperforms existing methods, and reveals key gene regulation patterns with robust downstream analysis. Introduction Gene regulatory networks (GRNs) within single-cell RNA sequencing (scRNA-seq) datasets present a sophisticated interplay of transcription factors (TFs) and target genes, uniquely capturing the modulation of gene expression and thereby delineating the intricate cellular functions and responses within diverse cell populations.[38]^1 GRNs illuminate core biological processes and underpin applications from disease modeling to therapeutic design,[39]^2^,[40]^3^,[41]^4 empowering researchers to interpret the mechanisms of gene interactions within cells and leverage this understanding for medical and biotechnological innovations.[42]^5^,[43]^6 Numerous methodologies have emerged for inferring GRNs from single-cell transcriptomic data. The algorithms emphasize co-expression networks in a statistical way (e.g., PPCOR[44]^7 and LocaTE[45]^8) or aim to decipher causal relationships between TFs and their target genes based on the analysis of the gene interactions among cells (e.g., DeepSEM[46]^9 and PIDC[47]^10). Despite their achievements, these algorithms still have inherent limitations. Specifically, these approaches mainly focus on cellular heterogeneity and overlook the critical importance of simultaneously considering cellular heterogeneity and gene module information in the model design. Generally, from the view of underlying principles, we can divide the methodologies into deep learning methods and traditional statistical algorithms. Many deep learning-based (e.g., DeepTFni[48]^11 and DeepSEM[49]^9) methodologies primarily build upon foundational models.[50]^12^,[51]^13 The frequent oversight in these models is the inherent relationships between cells and genes, as informed by domain expertise. This often leads to models that compromise on explainability and narrow their application scope. For the traditional statistical algorithms, such as Bayesian networks[52]^14^,[53]^15 and ensemble methods,[54]^16^,[55]^17^,[56]^18 it can be computationally expensive, and it remains a challenge to extend these methodologies to encompass broader nonlinear paradigms. Additionally, the scRNA-seq data are frequently marred by noise and incompleteness, attributable to phenomena such as amplification biases inherent to reverse transcription and PCR amplification processes,[57]^19^,[58]^20 as well as the issue of low quantities of nucleic acids in single cells. To get a more robust GRN, several methodologies[59]^21^,[60]^22 leverage multi-omic datasets, capturing different kinds of cellular information to enrich the model’s comprehensiveness. However, integrating multi-omic datasets presents substantial challenges, particularly regarding harmonizing data from disparate sources and platforms, and could also introduce additional noise.[61]^23 To address the problems and construct a reliable GRN, we model scRNA-seq data as a hypergraph and present hypergraph variational autoencoder (HyperG-VAE), a Bayesian deep generative model to process the hypergraph data. Distinct from current approaches, HyperG-VAE simultaneously captures cellular heterogeneity and gene modules (in GRN analysis, gene modules refer to clusters of genes that are regulated together by the same set of TFs) through its cell and gene encoders individually during the GRN construction. Two encoders employ variational inference to learn stochastic representations of genes and cells, offering a more flexible and robust approach to managing real-world data complexities. This could be particularly effective in handling noise in scRNA-seq datasets, a capability that has been demonstrated in previous studies.[62]^24^,[63]^25^,[64]^26 Within a shared embedding space, the dual encoders of our model interact, boosting its cohesiveness. The joint optimization manner elucidates gene regulatory mechanisms within gene modules across various cell clusters, thereby augmenting the model’s ability to delineate complex gene regulatory interactions and significantly improving its explainability. Our study evaluates the performance of HyperG-VAE in various scRNA-seq applications. These include (1) GRN inference, (2) cell embedding, (3) gene embedding, and (4) gene regulation hypergraph construction. Through benchmark comparisons, encompassing tasks like GRN inference, data visualization, and single-cell clustering, we establish that HyperG-VAE outperforms existing state-of-the-art methods. Additionally, HyperG-VAE demonstrates its utility in elucidating the regulatory patterns governing B cell development in bone marrow. Our model also excels in learning gene expression modules and cell clusters, which connect the gene encoder and cell encoder individually to boost gene regulatory hypergraph prediction. This integrated functionality of HyperG-VAE improves our comprehension of single-cell transcriptomic data, ultimately providing better insights into the realm of GRN inference. Results Framework overview We introduce HyperG-VAE, a Bayesian deep generative model specifically designed to address the complex challenge of gene regulation network inference using scRNA-seq data, which are represented as a hypergraph ([65]Figure 1; [66]STAR Methods). Our HyperG-VAE takes into account the interplay between gene modules and cellular heterogeneity, allowing for a more accurate representation of cell-specific regulatory mechanisms. This interplay could be incorporated into a hypergraph to capture the nuanced interactions of genes across diverse cellular states. Figure 1. [67]Figure 1 [68]Open in a new tab Overview of HyperG-VAE (A) HyperG-VAE, which takes the expression value matrix derived from scRNA-seq data as input. In the provided table, four cells exhibit expression across fifteen genes, with color gradients indicating varying gene expression levels (white circles mean no expression). (B) The colored circles with serial numbers denote distinct genes, expressed within specific cells, functioning as interlinked nodes. These nodes are interconnected by a singular hyperedge (small dashed ellipses) symbolizing the cell (triangle). Together, these nodes and hyperedges form a hypergraph structure. Node coloration reflects a composite of gene expression levels of the given gene across cells; for instance, gene 3 manifests a blend of green and blue hues. The largest dashed ellipse is the genome shared by all cells. (C) The neural network architecture of HyperG-VAE, where two encoders are designed to process the provided input matrix. The cell encoder uses the structural equation model (SEM) to discern cellular heterogeneity and form the GRN, while the gene encoder, employing a hypergraph self-attention mechanism, focuses on gene module analysis. The decoder subsequently reconstructs the input matrix, leveraging the shared latent space of both gene and cell embeddings. The inferred gene regulation hypergraph integrates cellular and gene representations, drawing on relationships derived from the learned GRN. (D–G) Downstream tasks that can be pursued by HyperG-VAE include GRN construction, clustering both cells and genes, and modeling the interplay between gene modules and cellular heterogeneity. Further details are provided in the legend, located in the top right corner. In the context of hypergraphs, we construct the hypergraph by representing cells as individual hyperedges, with the genes expressed in each cell serving as the nodes included within those hyperedges ([69]Figures 1A and 1B). Specifically, let [MATH: HVRm×n< /mrow> :MATH] denote the scRNA-seq expression matrix, where m is the number of cells and n is the number of genes. The incidence matrix [MATH: M{0 ,1}m×n :MATH] encodes the hypergraph structure: if a gene (node) i is expressed in cell (hyperedge) j ( [MATH: HijV>0 :MATH] ), then [MATH: Mij=1 :MATH] . This construction captures the relationship between cells and their expressed genes, enabling the sparse scRNA-seq data to be effectively represented as a hypergraph. HyperG-VAE incorporates two encoders, a cell encoder and a gene encoder, enabling it to learn the hypergraph representation [MATH: HV :MATH] ([70]Figure 1C) with structure [MATH: M :MATH] . The cell encoder generates cell representations [MATH: HE :MATH] in the form of hypergraph duality, facilitating the embedding of high-order relations via structural equation layers. GRN construction ([71]Figure 1D) is realized in this structural equation layer through a learnable causal interaction matrix. In addition, the cell encoder can adeptly capture the gene regulation process in a cell-specific manner, elucidating a clearer landscape of cellular heterogeneity ([72]Figure 1E). The gene encoder is specifically designed to process observed gene representations, denoted as [MATH: HV :MATH] . Given that genes within a module generally manifest consistent expression profiles across cells, we employ a multi-head self-attention mechanism that is specifically designed for the hypergraph in this work. This not only discerns varying gene expression levels but also assigns appropriate weights to the genes expressed in the same cell during the message-passing phase. Thus, the gene encoder enhances the model’s ability to understand and integrate the intricate interdependencies among genes, thereby aiding in the effective embedding of gene clusters ([73]Figure 1F). Finally, a hypergraph decoder is utilized to reconstruct the original topology of the hypergraph ([74]Figure 1G) using the learned latent embedding of genes and cells. Utilizing the reconstructed hypergraph and the learned inter-gene relationships, we can also infer a gene regulatory hypergraph ([75]Figure 1G). This hypergraph encompasses gene regulatory modules that span across various cell stages. HyperG-VAE enhances GRN inference by incorporating the above two encoders to mutually augment each other’s embedding quality ([76]Figure 1C) while preserving the high-order gene relations among cells, constrained by hypergraph variational evidence lower bound ([77]STAR Methods). Specifically, the cell encoder incorporates a structure equation model (SEM) on gene co-expression space to infer the GRNs; the learning of gene modules by the gene encoder aids in the inference of GRNs since the gene module conceivably incorporates TF-target regulation patterns. By integrating the embedding of genes and cells through joint learning, we observe the substantial performance of downstream tasks ([78]Figures 1D–1G), including the inference of GRNs, cell clustering, gene clustering, and interplay characterization between gene modules and cellular heterogeneity, among others. HyperG-VAE achieves accurate prediction of GRNs We evaluate the performance on GRN inference of HyperG-VAE based on the setting of the BEELINE framework.[79]^27 Our evaluation encompassed seven scRNA-seq datasets. This includes two cell lines from humans and five mouse cell lines (more details can be found in the [80]supplemental information). We evaluate GRN performance using two metrics: EPR, which assesses the enrichment of true positives among the top K predicted edges relative to random predictions, and AUPRC, which measures the area under the precision-recall curve to account for class imbalance. These metrics are applied across four types of ground-truth datasets: STRING,[81]^28 non-specific chromatin immunoprecipitation (ChIP)-seq,[82]^29^,[83]^30^,[84]^31 cell-type-specific ChIP-seq,[85]^32^,[86]^33^,[87]^34 and loss-/gain-of-function (LOF/GOF) networks.[88]^34 As recommended by Pratapa et al.,[89]^27 our analysis for each dataset prioritized the most variable TFs and the top N most-varying genes, where N is set to 500 and [MATH: 1,000 :MATH] . We selected seven state-of-the-art baseline algorithms based on the evaluation of BEELINE to compare with HyperG-VAE: DeepSEM,[90]^9 GENIE3,[91]^17 PIDC,[92]^10 GRNBoost2,[93]^18 SCODE,[94]^35 ppcor,[95]^7 and SINCERITIES.[96]^36 Overall, HyperG-VAE demonstrates a discernible enhancement in performance when compared with other baseline methods in terms of both AUPRC and EPR metrics ([97]Figures 2 and [98]S2). For scaled results of datasets composed of all significantly varying TFs and the 500 most-varying genes (as shown in [99]Figure 2), HyperG-VAE surpasses the seven other benchmarked methods in 40 of the 44 ( [MATH: 91% :MATH] ) evaluated conditions. Compared with the second-best method (DeepSEM), HyperG-VAE enhances the results by at least [MATH: 10% :MATH] in 19 out of the 44 benchmarks. Furthermore, in comparison to other commendable approaches, such as PIDC and GENIE3, our approach registered significant enhancements. For PIDC, 38 out of 44 instances showed improvements of over [MATH: 10% :MATH] , with 27 surpassing [MATH: 30% :MATH] and 22 going beyond [MATH: 50% :MATH] . Similarly, with GENIE3, 33 out of 44 instances marked at least a [MATH: 10% :MATH] enhancement, 26 surpassed [MATH: 30% :MATH] , and an impressive 20 recorded at least a [MATH: 50% :MATH] increase. For results of datasets composed of all significantly varying TFs and the 1000 most-varying genes ([100]Figure S2), HyperG-VAE achieves the best prediction performance on [MATH: 84% :MATH] ( [MATH: 37/44 :MATH] ) of the benchmarks. In comparison to the runner-up method, DeepSEM, HyperG-VAE outperforms by a margin of at least [MATH: 10% :MATH] in 17 of the 44 evaluated benchmarks. Notably, the average enhancement in EPR stands at [MATH: 11.35% :MATH] , while that in AUPRC is [MATH: 7.16% :MATH] . Figure 2. [101]Figure 2 [102]Open in a new tab Benchmarks of different GRN inference methods on experimental scRNA-seq datasets by EPR and AUPRC scores The performance of HyperG-VAE is contrasted against seven alternative algorithms across seven datasets. Each dataset comprises all significantly varying transcription factors (TFs) and the 500 most-varying genes. These evaluations are based on four distinct ground-truth benchmarks: non-specific ChIP-seq, STRING, cell-type-specific ChIP-seq, and LOF/GOF. For each figure pair, the left image depicts the median EPR results, while the right image shows the median AUPRC outcomes. Results inferior to random predictions are excluded from the visualizations for clarity. The color scale in each dataset is normalized between 0 and 1 using a min-max scaling approach. EPR is defined as the odds ratio of true positives among the top K predicted edges, where K represents the number of edges in the ground-truth GRN, compared to random predictions. Similarly, the AUPRC ratio reflects the odds ratio of the AUPRC value between the model and random predictions. With single-cell sequencing data, robustly inferring GRNs from limited cells is pivotal, especially for capturing rare cellular phenotypes and transient states.[103]^9^,[104]^11 Here, we explore the fluctuations in EPR performance and the robustness of HyperG-VAE when confronted with limited training data ([105]Figure S3A). We constructed mouse embryonic stem cell (mESC) datasets[106]^37 composed of all significantly varying TFs and the 500 and 1,000 most-varying genes, respectively, and evaluated the accuracy based on four unique ground-truth benchmarks by randomly subsampling single cells following the BEELINE benchmark.[107]^27 Upon adjusting the number of subsampled single cells to 400, 300, 200, 100, and 50, we registered average performance retentions of [MATH: 94% :MATH] , [MATH: 92% :MATH] , [MATH: 91% :MATH] , [MATH: 80% :MATH] , and [MATH: 53% :MATH] , respectively. Remarkably, when training with cell counts exceeding 100, a robust [MATH: 79.17% :MATH] ( [MATH: 19/24 :MATH] ) retained more than [MATH: 90% :MATH] of their performance, and for counts greater than 50, a compelling [MATH: 87.50% :MATH] ( [MATH: 28/32 :MATH] ) maintained above [MATH: 80% :MATH] efficacy. When utilizing cell-type-specific ChIP-seq as the benchmark, the performance remains notably stable, with an average performance retention of 93%. Furthermore, when assessed against the other three ground truths and the training cell count exceeds 50, there is only a modest decline in efficacy, averaging [MATH: 88% :MATH] performance retention in comparison to the median value derived from all cells. Beyond performance evaluation, we also examined HyperG-VAE’s scalability with expansive datasets ([108]Figure S3B). HyperG-VAE reveals the gene regulation patterns of B cell development in bone marrow To evaluate HyperG-VAE’s proficiency in elucidating GRNs and to assess the effectiveness of both cell clustering embedding and gene module embedding components within HyperG-VAE, we deployed HyperG-VAE on scRNA-seq data of B cell development in bone marrow[109]^38 (more details of the data can be found in the [110]STAR Methods and [111]Table S3), as illustrated in [112]Figure 3. The progression of B cell development from hematopoietic stem cells follows a sequential yet adaptable developmental pathway governed by interactions among environmental stimuli, signaling cascades, and transcriptional networks.[113]^39 Throughout this developmental trajectory, TFs play a pivotal role in regulating the cell cycle, differentiation, and advancement to subsequent developmental stages. These critical checkpoints encompass the initial commitment to lymphocytic progenitors, the specification of pre-B cells, progression through immature stages, entry into the peripheral B cell pool, B cell maturation, and subsequent differentiation into plasma cells.[114]^40 Each of these regulatory nodes is controlled by complex transcriptional networks, which, along with sensing and signaling systems, determine the final outcomes. Figure 3. [115]Figure 3 [116]Open in a new tab GRN prediction by HyperG-VAE across developmental B cell states in bone marrow (A) t-distributed stochastic neighbor embedding (t-SNE) visualization of cell embedding on the bone marrow B cell dataset; the embedding is learned by the cell encoder of HyperG-VAE. Black lines depict the trajectory from pre-pro-B cells to mature B cells. (B) Heatmap/dot plot showing TF expression of the regulon on a color scale and cell-type specificity (RSS [regulon specificity score]) of the regulon on a size scale. (C) The accuracy of GRN prediction by cross-validation with publicly available ChIP-seq datasets. The overlap coefficient quantifies the concordance between sets of target genes for each TF, as derived from GRN prediction and ChIP-seq database, respectively. The x axis represents the difference value of overlap coefficients between HyperG-VAE and SCENIC (default). Pink lines indicate superior performance by HyperG-VAE, while blue lines favor the default SCENIC. The dot plot illustrates the overlap coefficient of the more effective approach for each regulon, depicted on a color gradient. Cell states are arranged in a sequence that reflects the progression of bone marrow B cell development stages. (D) The GRN visualization for the bone marrow B cell dataset with ten states from pre-pro-B state to plasma state, as delineated by HyperG-VAE; the inner circle shows the co-binding of shared target genes, while the outer circle presents TF-focused target genes. HyperG-VAE uncovers the cell embedding by dimensionality reduction and distinctly segregates the primary cell types across various stages of bone marrow B cell development ([117]Figure 3A). Significantly, HyperG-VAE also effectively captures the linear progression of B cell development, spanning from early pro-B, late pro-B, large pre-B, small pre-B, and immature B to mature B cells. In our pursuit to unveil the gene regulation patterns in developmental B cells, our HyperG-VAE, in conjunction with SCENIC,[118]^41 successfully identifies established master regulators associated with different developmental stages ([119]Figures 3B and [120]S4), including pre-pro-B (Runx2), pro-B (Ebf1 and Lef1), large pre-B (Myc and Hmgb2), small pre-B (Tcf3 and Sox4), immature B (Relb and Egr1), mature B (Nfkb2), and plasma (Cebpb and Prdm1) cells. Furthermore, we conducted a benchmark assessment to compare the performance of HyperG-VAE against SCENIC using its default settings. Using the ChIP-seq database,[121]^33 the accuracy was evaluated based on the degree of overlap coefficient between the ChIP-seq coverage and the predicted target genes from both methods. Our HyperG-VAE, when combined with SCENIC, demonstrates superior performance compared to the standard SCENIC approach, exhibiting higher accuracy in detecting TF-target patterns for the key TFs (as illustrated in [122]Figure 3C). The comprehensive gene regulation network spanning the developmental B cells in the bone marrow is depicted in [123]Figure 3D. We find that the GRNs show TF-target regulation patterns in two ways: TFs co-binding to shared predicted enhancers (the inner circle in [124]Figure 3D) and TF-specific target genes (the outer circle in [125]Figure 3D). We also observe that the cooperativity between TFs is stronger within cell types along the development path, indicating that some TFs are involved in multiple stages of B cell development. Gene expression module learning enhances HyperG-VAE in GRN inference Our HyperG-VAE model augments the GRN prediction by integrating gene space learning, as depicted in [126]Figure 4C. HyperG-VAE uncovers the gene expression modules visualized by uniform manifold approximation and projection (UMAP)[127]^42 in [128]Figure 4A. By associating these gene modules with the key TFs and corresponding target genes of pathways along B cell development, we annotate the gene modules with specific cell types, indicating that these gene clusters are activated in different stages of developmental B cells ([129]Figures 4A and 4B). Figure 4. [130]Figure 4 [131]Open in a new tab The interplay between gene embedding and cell clusters (A) Gene embedding by the gene encoder of HyperG-VAE on developmental B cell data. Gene clusters encoded by numbers are associated with different cell types by colors. (B) The heatmap illustrates normalized overlap values between gene clusters and TF regulons from different B cell states. Here, genes serve as a bridge to compute the overlap, with lighter colors representing larger overlap scores. (C) t-SNE visualization of cellular embeddings with highlighted pre-BCRi B cell state, together with the associated regulon, Phf8_(+), and related target genes. (D) Pathway enrichment analysis on gene cluster 5 with associated molecular complex detection (MCODE) network components. (E) Pathway enrichment analysis of different gene clusters. The shaded pathways show the dominant gene programs for each gene cluster. We further apply gene set enrichment analysis (GSEA)[132]^43 ([133]STAR Methods) to investigate the gene clusters ([134]Figures 4E and [135]S1; [136]Tables S5 and [137]S6). The pathways identified through GSEA validate the accuracy of our gene cluster annotations. For example, large pre-B cells (cluster pre-BCRi [B cell receptor independent] B) are associated with signals initiating diverse processes, which include proliferation and recombination of the light-chain gene[138]^44; the GSEA results show the related pathways: lymphocyte proliferation, cell activation, and B cell receptor signaling pathway. Immature B cells exhibit B cell central tolerance, which is governed by mechanisms such as receptor editing and apoptosis.[139]^45 The pathways identified in the corresponding gene clusters include antigen processing and presentation of exogenous peptide antigen, DNA damage response, regulation of cell killing, and apoptotic signaling pathways. Plasma cells are terminally differentiated B lymphocytes that secrete immunoglobulins, also known as antibodies.[140]^46 Considering the substantial demands placed on these cells for secretory biological processes, the pathways associated with the relevant gene cluster shed light on the cellular response to endoplasmic reticulum stress. We show that the gene modules are associated with different biological pathways during B cell development in the bone marrow. These gene modules implicitly incorporate the gene regulation patterns, leading to different cell types. On the other hand, distinct cell types of B cell clustering are engaged in various immunological environments,[141]^39^,[142]^47 resulting in different signaling pathways for B cell activation and fate decisions. We exemplify this joint relationship with an example involving B cells at the large pre-B stage, as shown in [143]Figures 4C and 4D. This specific cell state ([144]Figure 4C) is characterized by gene regulation patterns associated with cell proliferation, reflected by the regulon Phf8(+).[145]^48 The corresponding gene cluster ([146]Figure 4D) is linked to a molecular complex detection (MCODE) network, which belongs to the lymphocyte proliferation pathway ([147]Figure 4E) and shares target genes with the Phf8(+) regulon. Therefore, our HyperG-VAE reciprocally integrates these two concepts, cell clustering and gene module detection, with the aim of revealing GRNs. Concretely, the cell embedding process groups together similar cells that share common pathways, while the gene modules aggregate genes exhibiting similar regulation patterns, thereby enhancing the accuracy of GRN computations. HyperG-VAE constructs the cell-type-specific GRN on B cell development in bone marrow We have demonstrated that gene modules associated with various biological pathways correspond to distinct cell types within bone marrow development in B cells. Essentially, these distinct gene regulation patterns influence cell fate commitment, leading to the development of diverse cell types with varying gene regulation profiles. Thus, we employ HyperG-VAE to investigate each individual state of developmental B cells and construct a more accurate GRN for B cells at specific developmental stages, as illustrated in [148]Figure 5. B cell development in the bone marrow can be broadly categorized into four states: pro-B, large pre-B, small pre-B, and mature B.[149]^40 These four stages are visualized using UMAP, as depicted in [150]Figure 5B. For each of these states, we employed HyperG-VAE to compute GRNs and uncover the predominant regulatory patterns, as illustrated in [151]Figures 5A–5C. HyperG-VAE effectively reveals the key TFs and their associated target genes within each cell state. For example, in the pro-B state, Ebf1[152]^49 and Pax5[153]^50 play significant roles, while Myc[154]^38 stands out in the large pre-B state; Bach2[155]^38^,[156]^51 is crucial in the small pre-B state, and Klf2[157]^52 and Ctcf[158]^53 are notable in the mature state. Figure 5. [159]Figure 5 [160]Open in a new tab Cell-type-specific GRN analysis of developmental B cells in bone marrow (A) The Sankey diagram shows significant regulons and corresponding target genes of different states along B cell development in bone barrow, with the normalized enrichment score (NES) encoded by color shade and the area under the curve (AUC) score by dot size. (B) Gene regulatory hypergraph at the cell clustering level, illustrating the four principal B cell states as four hyperedges. Conserved TFs are highlighted with red dots, and target genes are depicted as diamonds, where size reflects the log fold change (logFC) in gene expression of a given state compared to others. Highly expressed genes are labeled in the figure. (C) The motif of significant TFs along the principal stages. Additional motif details can be found in [161]Figure S5. (D) Heatmap displays the expression of the top genes, ranked by logFC, across cells classified into four distinct cell states. The genes are selected by the overlap of top logFC genes and predicted target genes. The genes’ color corresponds to the cell states in which the regulation pattern is predicted. (E) Volcano plot of differentially expressed genes of different states. The blue inverted triangles denote downregulated genes, and the red triangles denote upregulated genes. The aforementioned TFs, along with their respective target genes, collectively constitute the regulons that characterize the four major cell states, allowing for the construction of a gene regulatory hypergraph at the cell clustering level ([162]Figures 5A and 5B). For each major state, we overlap the top-predicted target genes by HyperG-VAE ([163]Figure 5B) with the differentially expressed genes (DEGs; [164]Figure 5E) and identify the principal marker genes ([165]Figure 5D). Specifically, Ebf1 and Pax5 are essential in the pro-B state of bone marrow to maintain an early B cell phenotype characterized by the expression of B cell-specific genes such as Vpreb and Igll1 for surrogate light-chain production.[166]^49^,[167]^50 In the large pre-B stage, the enriched regulons encompass the TF Myc[168]^38 and other genes related to the cell cycle, such as Mki67, Cenpf, Cenpa, and Hmgb2. Additionally, nucleosome-related genes, such as Hist1h2ae and Hist1h3c, are also enriched in this state due to the high rate of cell proliferation. In the small pre-B stage, both Bach2 and Btg1 restrain cell proliferation.[169]^54^,[170]^55 It is noteworthy that the mature state markers H2-Ab1, H2-Eb1, H2-Aa, and Cd74 are assigned as target genes in the pro-B stage, suggesting that these genes may be actively repressed in the early B cell development stage. HyperG-VAE addresses cellular heterogeneity and learns the cell representation Cellular heterogeneity is a hallmark of complex biological systems, manifesting as diverse cell types and states within scRNA-seq datasets.[171]^56 We hypothesize that the latent space inferred by the cell encoder of HyperG-VAE captures this biological variability among cells. Leveraging domain expertise, we can map these clusters to known cell types or states, ensuring that the computational predictions align with manual inspection and annotation. To evaluate the performance, we applied HyperG-VAE to three biologically relevant scRNA-seq datasets, including an Alzheimer’s disease (AD) dataset,[172]^57 a colorectal cancer dataset,[173]^58 and the widely used mouse brain dataset, known as the Zeisel dataset[174]^59 (more dataset details can be found in [175]Table S4). To benchmark HyperG-VAE, we also compared its low-dimensional embeddings with those of six other algorithms: autoCell,[176]^60 DCA,[177]^61 scVI,[178]^62 DESC,[179]^63 SAUCIE,[180]^64 and scVAE.[181]^65 We followed the Louvain algorithm[182]^66 to cluster all the single cells into an identical number of clusters for each method ([183]STAR Methods). To assess the precision of clustering against established reference labels, we employed four metrics: the adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity (HOM), and completeness (COM). These metrics span a scale from 0, indicating random clustering, to 1, signifying perfect alignment with reference clusters, with superior values indicating enhanced accuracy. Overall, the performance of HyperG-VAE surpasses that of its counterparts, as evidenced in [184]Figure 6A. Specifically, for the Zeisel dataset, the clusters generated using HyperG-VAE align more closely with the existing cell type annotations, registering an NMI of [MATH: 83.1% :MATH] and an ARI of [MATH: 83.7% :MATH] . In comparison, the next best-performing algorithm, autoCell, recorded an NMI of [MATH: 78.0% :MATH] and an ARI of [MATH: 80.6% :MATH] . Furthermore, we evaluated HyperG-VAE’s latent space to determine its ability to capture the biological diversity among individual cells in the Zeisel dataset, as illustrated in [185]Figure 6B. We visualized the data embedding by UMAP. In previous sections, we showed that HyperG-VAE effectively identifies gene expression differences within cells of the same type. The compact UMAP highlights distinct clusters while preserving intra-cluster heterogeneity, demonstrating its ability to capture both inter- and intra-cluster variability. Compared to other algorithms, the distinct separation observed with HyperG-VAE across most clusters indicates effective clustering, suggesting that HyperG-VAE’s cell encoder adeptly distinguishes between various cell states or types. While algorithms such as autoCell, scVI, and scVAE have achieved results that are comparable, the differentiation between their clusters is not as pronounced as with HyperG-VAE. For the remaining algorithms, the substantial overlap among clusters hinders the classifier from producing optimal results. Specifically, compared to other methodologies founded on conventional single-layer VAEs, the enhanced visualization capabilities of HyperG-VAE underscore the potential benefits of incorporating gene modules in cell embedding processes. Figure 6. [186]Figure 6 [187]Open in a new tab Benchmarks of single-cell clustering and embedding (A) The cell clustering performance of HyperG-VAE on the single-cell datasets compared with six baseline methods on four key metrics: NMI, ARI, COM, and HOM. NMI, normalized mutual information (the higher the value, the better); ARI, adjusted rand index (the higher the value, the better); COM, completeness (the higher the value, the better); HOM, homogeneity (the higher the value, the better). (B) UMAP visualization of latent representations on the Zeisel dataset for different methods. The black circles highlight areas with ambiguous classification boundaries. Discussion In this work, we introduce HyperG-VAE, a sophisticated model designed for the construction of GRNs. Uniquely, HyperG-VAE leverages a hypergraph framework, wherein genes expressed within individual cells are represented as nodes connected by distinct hyperedges, capturing the latent gene correlations among single cells. As a key algorithmic innovation of HyperG-VAE, the transformation of scRNA-seq data into a hypergraph offers unique advantages compared to existing GRN inference methods. These advantages include improved modeling of cellular heterogeneity, enhanced analysis of gene modules, increased sensitivity to gene correlations among cells, and improved visualization and interpretation of GRNs. This direct use of a hypergraph, as opposed to traditional pairwise methods like star expansion (SE) and clique expansion (CE),[188]^67 captures complex multi-dimensional relationships more effectively, avoiding the increased complexity and information loss associated with SE and CE. By maintaining the hypergraph’s original form, HyperG-VAE preserves the data’s full complexity and integrity, enhancing analytical depth and reducing computational demands. In addition to modeling scRNA-seq data in a hypergraph, HyperG-VAE effectively integrates gene modules and cellular heterogeneity, demonstrating superior performance compared to existing methods. On the one hand, our study reveals that HyperG-VAE outperforms related existing state-of-the-art algorithms in GRN inference, cell type classification, and visualization tasks, respectively, as evidenced by its enhanced performance across several widely recognized benchmark datasets. On the other hand, we also utilize HyperG-VAE on scRNA-seq data of B cell development in bone marrow[189]^38 to evaluate its performance in a biologically relevant context. Firstly, HyperG-VAE achieves accurate prediction of GRNs and successfully identifies key master regulators and target genes across different developmental stages. Meanwhile, we cross-validated our results with publicly available ChIP-seq datasets,[190]^33 further demonstrating HyperG-VAE’s robust performance in predicting regulons based on GRN inference. Secondly, subsequent evaluations across various tasks further highlighted the effectiveness of HyperG-VAE’s carefully designed encoder components, with their synergistic interaction significantly bolstering the model’s overall performance. Specifically, the cell encoder within HyperG-VAE predicts the GRNs through a structural equation model while also pinpointing unique cell clusters and tracing the developmental lineage of B cells; the gene encoder uncovers gene modules that implicitly encapsulate patterns of gene regulation, thereby enhancing the accuracy of GRN predictions. To demonstrate this interaction, we highlight the shared genes between gene clusters and the predicted target genes within cell clusters. These shared genes are notably present in pathways identified by GSEA, signifying the connections between gene modules identified by gene encoders and cell clusters delineated by cell encoders. Our proposed model, HyperG-VAE, holds promise as a foundational framework, adaptable to a multitude of biological contexts in future research endeavors. A promising future direction is extending HyperG-VAE into a heterogeneous hypergraph VAE by incorporating additional omics data, such as single-cell ChIP-seq. Integrating single-cell ChIP-seq data into HyperG-VAE would enhance GRN construction, complementing transcriptomic data and revealing upstream regulatory events. This kind of integration would enable seamless multi-omics data fusion and further advance GRN inference and cellular regulation analysis. Additionally, while the present model does not explicitly use metadata for genes and cells, future enhancements that integrate these metadata into the hypergraph-centric framework could significantly improve the representations of nodes (genes) and hyperedges (cells). The weights assigned to these hyperedges can also be factored into the model’s learning phase, offering a more comprehensive analysis. In the generative phase of HyperG-VAE, gene-cell interactions proceed through a cohesive mechanism, facilitating the development of a robust GRN underscored by the interplay between gene modules and cell clusters. Moreover, advancing to a single-cell-level, fine-grained gene-coexpression hypergraph study could further enhance our understanding of single-cell dataset analysis. Furthermore, subsequent research could explore the dynamic construction of temporal GRNs on chronological single-cell data, drawing upon the foundational principle of simultaneously considering cellular heterogeneity and gene modules, as demonstrated in this work. Overall, HyperG-VAE provides a competitive solution for GRN construction and related downstream works. By combining cell-specific GRN inference, hypergraph-based gene module identification, and integrated cell-gene latent representations, HyperG-VAE delivers biologically relevant insights that extend beyond traditional GRN methods. It provides researchers with powerful tools to explore cell-specific regulatory mechanisms, identify gene modules, and generate testable predictions, thereby advancing our understanding of complex biological systems. Limitations of the study Our HyperG-VAE leveraging the self-attention mechanism has undeniably propelled models to achieve remarkable performance.[191]^68^,[192]^69^,[193]^70^,[194]^71 However, despite its prowess, self-attention-based models still have inherent limitations. Specifically, the self-attention’s quadratic complexity concerning sequence length presents challenges. For sequences of length N, it necessitates [MATH: O(N2) :MATH] computations, rendering it computationally demanding and memory inefficient, especially for longer sequences. Future efforts to address this limitation will explore adapting the techniques of attention matrix sparse factorization and positive orthogonal random features, as demonstrated in studies,[195]^72^,[196]^73 to ease computational demands. Resource availability Lead contact Requests for resources of this article and any additional information should be directed to the lead contact, Wenjie Zhang (wenjie.zhang@unsw.edu.au). Materials availability This study did not generate new unique reagents. Data and code availability * • All datasets used in the work have been summarized in the [197]key resources table. * • All original code has been deposited at [198]https://github.com/guangxinsuuu/HyperG-VAE and is publicly available at [199]https://doi.org/10.5281/zenodo.15028720 as of the date of publication. * • Any additional information required to reanalyze the data reported in this paper is available from the [200]lead contact upon request. Acknowledgments