Abstract

   The interplay between transcription factors (TFs) and regulatory
   elements (REs) drives gene transcription, forming gene regulatory
   networks (GRNs). Advances in single-cell technologies now enable
   simultaneous measurement of RNA expression and chromatin accessibility,
   offering unprecedented opportunities for GRN inference at single-cell
   resolution. However, heterogeneity across omics layers complicates
   regulatory feature extraction. We present scTFBridge, a multi-omics
   deep generative model for GRN inference. scTFBridge disentangles latent
   spaces into shared and specific components across omics layers. By
   integrating TF-motif binding knowledge, scTFBridge aligns shared
   embeddings with specific TF regulatory activities, enhancing biological
   interpretability. Using explainability methods, scTFBridge computes
   regulatory scores for REs and TFs, enabling robust GRN inference. Our
   results further demonstrate that scTFBridge can identify
   cell-type-specific susceptibility genes and distinct regulatory
   programs, providing insights into gene regulation mechanisms at the
   single-cell level.

   Subject terms: Computational models, Gene regulatory networks
     __________________________________________________________________

   Recent work has enabled the simultaneous analysis of RNA levels and
   chromatin accessibility for gene regulatory network (GRN) inference,
   but heterogeneity across omics layers remains a barrier in these
   approaches. Here, the authors present scTFBridge, a deep-generative
   model for GRN inference, whereby latent spaces are disentangled into
   shared and specific components across omics layers.

Introduction

   Transcription of genes is primarily regulated by TFs, proteins that
   bind to specific DNA sequences to trigger diverse gene regulatory
   functions^[34]1. These regulatory elements (REs) are typically located
   in accessible chromatin regions, such as promoters, enhancers, and
   silencers^[35]2,[36]3. The binding of TFs to REs and their spatial
   chromatin interaction with target genes (TGs) ultimately determines
   specific gene transcription processes^[37]4. The complex interplay
   among TFs, REs, and TGs can be represented as interconnected GRNs, the
   foundational elements of gene regulation in single cells^[38]5–[39]7.
   Understanding the regulatory mechanisms within these GRNs is pivotal
   for elucidating how cellular identities are established and maintained.
   Such insights have far-reaching implications, from guiding cell fate
   engineering to informing strategies for disease prevention and
   treatment^[40]8–[41]10.

   Recent advances in single-cell high-throughput sequencing techniques
   have enabled the exploration of gene regulatory processes at
   single-cell resolution^[42]11–[43]13. For example, single-cell RNA
   sequencing (scRNA-seq) facilitates the inference of cell-type-specific
   trans-regulatory interactions through computational tools such as
   DeepSEM^[44]14, PIDC^[45]15, GENIE3^[46]16, SupirFactor^[47]17, and
   SCENIC^[48]18. Similarly, single-cell assay for transposase-accessible
   chromatin using sequencing (scATAC-seq) provides insights into
   chromatin accessibility at specific gene regions (Peaks)^[49]19,
   enabling the identification of TF-RE interactions related to gene
   regulation through methods like DeepTFni^[50]20, and DirectNet^[51]21.
   However, these distinct omics layers are typically measured in separate
   cells, complicating efforts to directly link RNA expression and
   chromatin accessibility within the same cell^[52]22. Unpaired data
   integration methods, such as scJoint^[53]23 and scGLUE^[54]24,
   integrate these datasets but are primarily optimized for tasks like
   cell-type annotation and harmonized data representation. Alternatively,
   cross-modality generation approaches, such as BABEL^[55]25 and
   UnpairReg^[56]26, predict gene expression from scATAC-seq data to
   enable deeper gene regulation analysis. Nevertheless, accurately
   capturing complex regulatory relationships within gene regulatory
   networks (GRNs) at single-cell resolution remains constrained by the
   availability of paired data and data quality.

   The emergence of single-cell multi-omics technologies, such as 10x
   Multiome^[57]27 and SHARE-seq^[58]28, has transformed the field by
   enabling simultaneous profiling of paired RNA-seq and ATAC-seq data
   from the same cell. This advancement facilitates more accurate GRN
   inference and a deeper understanding of gene regulation at the
   single-cell level. However, integrating these high-dimensional and
   heterogeneous multi-omics datasets remains a significant
   challenge^[59]29,[60]30. A key objective is to encode distinct omics
   layers into a unified latent space for a cohesive representation, as
   achieved by methods like MultiVI^[61]31, UnitedNet^[62]32.
   Additionally, specialized approaches, including LINGER^[63]33,
   scMultiomeGRN^[64]34, scMTNI^[65]35, scButterfly^[66]36, and
   SCENIC+^[67]37 have been developed to elucidate gene regulation across
   single-cell RNA-seq and ATAC-seq data. Despite these advances, a
   persistent hurdle in data integration is the modality
   gap^[68]38,[69]39. This gap arises from the intrinsic heterogeneity of
   different omics layers, where each modality captures distinct aspects
   of complex biological processes^[70]40. As a result, modality-specific
   private information can hinder the creation of a shared multi-omics
   latent space and impair accurate data reconstruction from the
   integrated embedding^[71]41,[72]42. Bridging this modality gap is
   essential for advancing multi-omics data integration and enabling more
   precise GRN analyses.

   In this study, we propose scTFBridge, an interpretable deep generative
   model for multi-omics integration and GRN inference at the single-cell
   level. By leveraging mutual information theory^[73]43–[74]45,
   scTFBridge minimizes mutual information and disentangles both RNA-seq
   and ATAC-seq data into modality-shared and modality-private subspaces.
   The modality-shared representations are further aligned through
   contrastive learning to construct a unified latent space. Incorporating
   biological priors of TF-motif bindings, scTFBridge constrains shared
   latent variables to represent specific TF regulatory activities,
   effectively acting as a bridge between RNA-seq and ATAC-seq modalities.
   Using deep learning explainability techniques, we derive regulatory
   scores linking REs and latent TF variables to TG expressions during the
   generation of TG expression. Our results demonstrate that scTFBridge
   effectively disentangles modality-shared and modality-private
   information, yielding biologically interpretable embeddings that
   capture cell-type-specific TF regulatory activities. Benchmarking
   against state-of-the-art methods on single-cell cis- and
   trans-regulation datasets, scTFBridge outperforms most baseline
   approaches in regulatory feature extraction across diverse cell groups.
   In summary, scTFBridge provides an efficient computational framework
   for exploring phenotype-specific gene regulatory mechanisms at
   single-cell resolution. It advances multi-omics data integration and
   enhances our understanding of gene regulation dynamics in complex
   cellular systems.

Results

The framework of scTFBridge

   scTFBridge is a biologically interpretable deep generative model
   designed for integrating single-cell multi-omics data and inferring
   cell-specific GRNs (Fig. [75]1a). It leverages a hybrid variational
   autoencoder^[76]46 (VAE) architecture that decomposes latent variables
   into modality-shared and modality-private components for each omics
   layer (Fig. [77]1b). Through mutual information theory and contrastive
   learning regularizations, scTFBridge effectively disentangles shared
   and private representations while aligning the shared latent space to
   capture common regulatory signals (Fig. [78]1c, Methods). To enhance
   biological interpretability, scTFBridge incorporates prior knowledge by
   constraining the weights of the shared ATAC-seq decoder (Fig. [79]1d).
   The shared latent variables were denoted as specific TF activities that
   contributed to modality reconstruction. Further enrichment analysis
   revealed the differential biological function of private/shared sets in
   a cell-specific manner (Fig. [80]1e). Using model interpretation
   techniques, scTFBridge enables the extraction of RE and TF regulatory
   scores for TGs, enabling the establishment of single-cell-specific GRNs
   (Fig. [81]1f, g).

Fig. 1. Overview of the scTFBridge framework.

   [82]Fig. 1
   [83]Open in a new tab

   a Schematic representation of RNA transcription, highlighting the
   cooperative roles of transcription factors (TFs) and regulatory
   elements (REs) in driving RNA synthesis. b The model architecture of
   scTFBridge features a hybrid multi-modal variational autoencoder (VAE)
   designed for latent space dismantlement and representation of mutual
   information (MI) learning. c Cross-mutual-information module for
   aligning modality-shared latent spaces and reducing MI between
   modality-shared and modality-private embeddings. d Regulatory module in
   the ATAC decoder, where TF–motif binding constrains the connections
   between TF variables and regulatory elements. e Definition of
   private/shared peak/gene sets according to their information flow from
   the input layer to disentangled latent space. Further enrichment
   analysis reveals the differential biological functions between
   private/shared sets in a cell-specific manner. f Cis-regulation
   inference depicted for target gene (TG) expression prediction based on
   ATAC data, where the objective function minimizes the mean squared
   error (MSE) between original and reconstructed gene expression levels.
   g Trans-regulation inference depicted for TG expression prediction
   using latent variables, employing the same computational strategy as in
   cis-regulation inference. Panel a was created with BioRender.com.

   In detail, we developed a Cross Mutual Information (CMI) module and
   contrastive learning (CL) loss to create a structured latent space that
   effectively separates and aligns information from both modalities
   (Fig. [84]1c). The CMI module employs independent neural networks to
   impose an upper bound^[85]47 on the MI (Methods), minimizing MI between
   shared and private embeddings within each modality, as well as between
   private embeddings across modalities. This forces common information
   into shared components. The CL loss pulls shared embeddings from the
   same cell closer together in the latent space, while pushing them away
   from the unrelated cells. This process aligns the two modalities,
   creating a unified space where the shared embeddings for a given cell
   are interchangeable, regardless of modality. Additionally, we employed
   a Product of Experts (PoE) module to integrate and maintain the final
   shared latent embedding. These approaches enable our model to learn a
   robust, shared representation that captures the common biology between
   RNA-seq and ATAC-seq while explicitly filtering out modality-specific
   noise.

   To preserve TF-RE regulatory relationships, we developed a regulatory
   module in scTFBridge that incorporates known TF-motif binding
   affinities to single-cell multi-omics data. In the shared latent
   embedding, latent variables are assigned to specific TFs. Prior
   knowledge of TF-motif binding affinities constrains the neural network
   weights, linking these TF latent variables to reconstructed REs
   (Fig. [86]1d, Methods). Concurrently, the same TF latent variables feed
   into the TG decoder network to capture paired gene expression
   information. The PoE module and the weight regularization ensure that
   the shared embeddings from RNA-seq and ATAC-seq data are aligned and
   interpretable at the latent variable level, enabling direct analysis of
   associations between multi-omics features and corresponding TFs in the
   shared latent space.

   To enhance clarity and precision in describing gene regulatory
   mechanisms, we employed SHAP (Shapley Additive Explanations) to
   quantify the contribution of input REs or shared latent TF variables to
   the TG expression reconstruction^[87]48 (Fig. [88]1f, g). SHAP scores
   were computed based on the mean squared error (MSE) between imputed and
   observed gene expression levels under specific feature perturbations
   (Methods). The biologically constrained model framework allows these
   SHAP scores to serve as regulatory scores for each RE and TF associated
   with TGs. For GRN inference, we derived both cis-regulation (RE-TG) and
   trans-regulation (TF-TG) interactions across diverse cell types using
   these scores.

scTFBridge facilitates cell-type-specific disentangled embedding learning

   We applied scTFBridge to a single-cell multi-omics dataset comprising
   paired RNA-seq and ATAC-seq measurements of bone marrow mononuclear
   cells (BMMCs) from healthy donors across multiple batches^[89]49. The
   learned latent embeddings of RNA-private and ATAC-private networks
   enabled visualization of domain-specific cell representations,
   alongside shared cell embeddings integrating both modalities. Uniform
   manifold approximation and projection (UMAP) analysis of the embeddings
   revealed distinct cellular information within each domain. In the
   RNA-private space, most cell types were poorly separated, with the
   clear exception of erythroblasts and normoblasts (Fig. [90]2a).
   Conversely, the ATAC-private embedding captured greater cellular
   heterogeneity (Supplementary Fig. [91]1). To investigate the limited
   resolution of RNA-private embeddings, we estimated the MI score between
   RNA-private, RNA-shared, and scRNA-seq embeddings generated by
   scVI^[92]50 (Supplementary Data [93]1). The results indicated that
   RNA-private embeddings contained minimal unique information compared to
   RNA-shared embeddings, with substantial overlap between RNA-seq and
   ATAC-seq data. This overlap likely stems from the use of single-nucleus
   RNA sequencing (snRNA-seq), which predominantly captures nuclear
   transcripts. Additionally, we performed Gene Set Enrichment Analysis
   (GSEA)^[94]51 to access functional roles of genes in RNA-private
   embeddings based on their contribution ranks to shared or private
   embeddings (Supplementary Fig. [95]2). The analysis revealed that
   RNA-private genes in erythroblasts and normoblasts were significantly
   enriched in the Heme Metabolism pathway, indicating that their gene
   expression during erythroblast differentiation is regulated
   independently of ATAC-seq signals.

Fig. 2. Interpretable TF-informed embedding learning of scTFBridge in BMMCs.

   [96]Fig. 2
   [97]Open in a new tab

   a UMAP visualization of modality-shared and RNA-private embeddings in
   BMMCs. b Comparison between ground-truth target gene (TG) RNA
   expression and scTFBridge-based reconstruction from ATAC-seq data. c
   Quantification and ratio of annotated promoter peaks across shared and
   private peak sets for different cell types. d Regulatory scores and
   target gene counts for cell-type-specific TF variables identified
   through biological interpretation analysis. e Visualization of
   cell-type-specific gene regulons derived from high-ranked target genes
   (TGs) associated with TF variables. The sample size of the box plot is
   the number of cells for each cell type, ranging from 171 to 2168 cells.
   Note that all boxplots in this study present minima and maxima, the
   smallest and largest value that is not considered an outlier; center,
   median; bounds of box, 25th (Q1) to 75th (Q3) percentile; whiskers, 1.5
   times the (Q3–Q1). Source data are provided as a Source Data file.

   To evaluate the scTFBridge’s capacity to learn cross-modal
   interactions, we performed a TG expression reconstruction experiment.
   In detail, we used the RNA-shared embedding inferred from ATAC-seq
   data, concatenated with a randomly sampled RNA-private embedding for TG
   reconstruction. The model’s performance was then quantified by the
   concordance between the clustering result of generated TG expression
   data and the ground-truth cell annotations (Methods). Compared to
   scButterfly^[98]36, scTFBridge achieved superior TG expression
   reconstruction accuracy and preserved cell-specific information more
   effectively (Fig. [99]2b, Supplementary Fig. [100]3). These results
   demonstrate that paired ATAC-seq data successfully capture key gene
   expression signals. Additionally, ablation experiments revealed that
   both MI and CL modules significantly contributed to RNA-seq data
   reconstruction and information retention (Supplementary Data [101]2).

   We further assessed the contributions of different ATAC-seq data Peak
   features to the ATAC-shared and ATAC-private embeddings based on the
   information flow through different encoders. Peaks were categorized
   into shared or private groups by their relative contributions to shared
   or private embeddings (Methods). Comparing the number and ratio of
   promoter Peaks in each group, our result revealed that the
   modality-shared embedding primarily enriched gene regulatory signals
   (Fig. [102]2c). We then explored the cell-type-specific gene regulatory
   elements through biological interpretation analysis using scTFBridge
   (Methods). Our results identified cell-type specific TFs, such as ESR
   in cDC cells, TCF7 in CD4 naïve T cells, and SOX4 in B cells
   (Fig. [103]2d). Additionally, high-ranked TGs of TF variables revealed
   cell-type-specific gene regulons (Fig. [104]2e). These findings
   demonstrate scTFBridge’s capacity for biologically disentangled
   learning and its effectiveness in elucidating potential cellular gene
   regulatory mechanisms.

scTFBridge reveals heritability enrichment of peaks in rheumatoid arthritis

   We applied scTFBridge to a single-cell multi-omics dataset from
   profiled immune cells from rheumatoid arthritis tissues^[105]52. To
   visualize the latent representations, we used UMAP to project the
   modality-shared, RNA-private, and ATAC-private representations into a
   two-dimensional space (Fig. [106]3a). In the RNA-private embedding,
   immune cells formed close clusters, suggesting that most of the gene
   regulatory information for these cells in the RNA-seq data is captured
   by modality-shared embeddings, as compared to the ATAC-seq data. The MI
   scores between RNA-shared and ATAC-shared representations was higher
   than the MI between RNA-shared and RNA-private or ATAC-shared and
   RNA-private, demonstrating that scTFBridge effectively disentangles
   modality-private and modality-shared components (Fig. [107]3b).

Fig. 3. Identifying cell-specific modality-shared and private regulatory
elements and functions in rheumatoid arthritis.

   [108]Fig. 3
   [109]Open in a new tab

   a UMAP visualization of modality-shared, RNA-private, and ATAC-private
   embedding in the dataset of rheumatoid arthritis. b Mutual information
   (MI) comparison between different representations. R and A denote RNA
   and ATAC modalities, respectively; s and p denote shared and private
   embeddings, respectively. c Gene Ontology (GO) pathway analysis
   comparing shared and private gene sets across different cell types. The
   enrichment statistical significance was assessed using the
   hypergeometric test, with -log[10](p-value) indicated. d Quantification
   and ratio of annotated promoter peaks across shared and private peak
   sets in different cell types. e Analysis of shared/private and
   dynamic/condition-specific (cs)-invariant peaks across cell types.
   Statistical significance was assessed using the one-sided Chi-square
   test, with p-values indicated. f Heritability enrichment analysis of
   shared and private peak sets of single-cells from RA samples. The
   x-axis shows the -log[10] enrichment p-value, and the dot size reflects
   the h^[110]2 effect. Heritability enrichment p-values were calculated
   using Stratified LD Score S-LDSC. A Z-test was performed on the
   regression coefficient (τ) for each annotation to determine if its
   contribution to heritability was significantly different from zero.
   Source data are provided as a Source Data file.

   To uncover deeper biological insights from distinct RNA-seq embeddings,
   we quantified the normalized absolute importance scores for each gene
   across different embeddings and identified the top 100 genes as their
   predominant gene programs (Methods). Gene Ontology (GO) pathway
   enrichment analysis revealed that the shared gene program primarily
   contributed to immune activation and receptor signaling, while the
   private gene program was associated with cell cycle regulation
   (Fig. [111]3c, Supplementary Data [112]3). Similarly, we assessed the
   enrichment of promoter peaks into modality-shared and ATAC-private
   embeddings. The results confirmed that the modality-shared embedding of
   scTFBridge captured a higher number of promoter peaks (Fig. [113]3d),
   highlighting its superior ability to capture gene regulatory signals in
   single-cell Multiome datasets.

   To further evaluate the scTFBridge model’s ability to characterize
   crucial regulatory elements, we compared the shared- and ATAC-private
   predominated REs identified by scTFBridge with the dynamic and
   cell-state-invariant (cs-invariant) peaks identified in Gupta et al.
   ^[114]52. Dynamic peaks were defined as those whose variation could be
   explained by paired RNA expression data, representing stronger gene
   regulatory effects^[115]52. In this comparison, we examined the overlap
   of peaks across two distinct groups (Fig. [116]3e). The results
   demonstrated that shared peaks were significantly enriched in the
   dynamic peak group (Chi-squared test, Methods). Additionally, we
   employed the stratified LD-score regression (S-LDSC)^[117]53 method to
   assess the heritability enrichment of shared and ATAC-private REs
   concerning GWAS-annotated summary statistics for rheumatoid arthritis
   (Methods). The enrichment scores revealed that shared peaks were more
   effective in capturing causal variants associated with the disease
   (Fig. [118]3f, Supplementary Data [119]4). These findings suggest that
   the scTFBridge model provides a more refined characterization of gene
   regulatory signals.

Benchmarking analysis of cell-type-specific cis-regulatory inference

   Accurate GRN inference remains challenging across different cell types
   and functional states due to the limited availability of gold-standard
   datasets and potential experimental biases. To assess the
   cell-type-specific cis-regulatory inference capabilities of scTFBridge,
   we integrated promoter-capture Hi-C (pcHi-C) data from three primary
   blood cell types^[120]54 and expression quantitative trait loci (eQTL)
   data from eight immune cell types^[121]55,[122]56. These external
   validation datasets include SNPs significantly associated with gene
   expression, serving as known cis-regulatory RE-TG pairs for performance
   benchmarking (Supplementary Data [123]5, Methods). We compared
   scTFBridge’s performance in cis-regulatory inference against
   established methods, including PCC, Distance, Random, and
   SCENIC+^[124]37 (Methods).

   Since the distance from transcription start sites (TSS) of TG is a
   critical factor in cis-regulatory inference, we categorized predicted
   RE-TG pairs into six distance groups, ranging from 0–5 kb to
   100–200 kb, for evaluation. Using single-cell eQTL data as the ground
   truth, scTFBridge achieved the highest area under the curve (AUC) and
   area under the precision-recall curve (AUPR) ratios across all distance
   groups in naive CD4 cells (Fig. [125]4a, b, Methods), whereas other
   methods performed comparably to random. Since SCENIC+ only provides
   final RE-TG pairs without relevance scores, we used the F1 score to
   assess its performance (Methods). scTFBridge outperformed SCENIC+ in F1
   scores across all distance groups (Fig. [126]4c). In other cell types,
   due to the limited availability of eQTL pairs, all RE-TG pairs were
   treated as positive labels without further distance-based
   stratification. scTFBridge consistently outperformed baseline methods
   in both AUC and AUPR ratios across all evaluated cell types, achieving
   the highest overall ranking (Fig. [127]4d, e) and significantly higher
   F1 scores compared to SCENIC+ (Fig. [128]4f). Additionally, when using
   pcHi-C data as ground truth, scTFBridge also achieved the highest AUC
   across all distance groups in naive CD4 T cells (Fig. [129]4g).

Fig. 4. Benchmarking analysis of cell-type-specific cis-regulatory inference
in PBMCs.

   [130]Fig. 4
   [131]Open in a new tab

   a, b Benchmark performance (AUC and AUPR ratio) of different methods
   for cis-regulatory inference in naïve CD4+ T cells, evaluated using
   eQTL data as ground truth. c F1 score comparison between scTFBridge and
   SCENIC+ for cis-regulatory inference in naïve CD4+ T cells. Statistical
   significance was assessed using a two-sided paired Student’s t-test,
   with *** indicating a p-value < 0.001. The sample size is the number of
   TSS distance groups with 6. The p-value is 0.00051. d, e Benchmark
   performance (AUC and AUPR ratio) of different methods of cis-regulatory
   inference across other cell types. f F1 score comparison between
   scTFBridge and SCENIC+ in the cis-regulatory inference across other
   cell types. Statistical significance was evaluated with a two-sided
   paired Student’s t-test, where ** indicates a p-value < 0.01. The
   sample size if the number of other cell types with 7. The p-value is
   0.0016. g Benchmark performance (AUC) of different methods in the
   cis-regulatory inference of naïve CD4 + T cells, using the pcHi-C data
   as ground truth. h scTFBridge-identified cis-regulatory interactions of
   AIF1 in CD14 Mono, CD4+ T central memory (TCM) cells, and conventional
   dendritic cells (cDC), supported by eQTL evidence. The sample size of
   box plot is the number of cells for each cell type: cDC (n = 36), CD4
   TCM (n = 223) and CD14 Mono (n = 511). Source data are provided as a
   Source Data file.

   To further demonstrate its utility, we examined cis-regulatory
   interactions identified by scTFBridge that align with known gene
   regulatory mechanisms across various cell types. For instance, AIF1
   gene expression varies among cell types, being highly expressed in CD14
   monocytes and cDCs, but with lower expression in T and B cells. To
   illustrate scTFBridge’s ability to capture cell-type-specific
   cis-regulatory interactions for AIF1, we analyzed a 200 kb region
   upstream and downstream region of its TSS, annotating
   cell-type-specific peak accessibility, eQTL SNP loci, and inferred
   RE-TG pairs predicted by scTFBridge (Fig. [132]4h). Using CD4 TCM cells
   as a reference, scTFBridge successfully identified the majority of
   known cis-regulatory elements in CD14 Mono and cDCs, supported by
   single-cell eQTL data. Additionally, scTFBridge uncovered potential
   regulatory elements that may be specific to these cell types
   (Fig. [133]4h). These results underscore the heterogeneity of
   cis-regulatory mechanisms across cell types and highlight scTFBridge’s
   potential to enhance and refine cell-type-specific GRN inference.

Benchmarking analysis of cell-type-specific trans-regulatory inference

   To evaluate the single-cell trans-regulatory inference capability of
   scTFbridge, we leveraged pre-identified TF-TG regulatory pairs from
   Chromatin Immunoprecipitation sequencing (ChIP-seq) data in the
   Cistrome database as the ground truth^[134]57. Specifically, we
   compiled regulatory data for 17 TFs and their corresponding putative
   target genes across four blood cell types (Supplementary Data [135]6).
   We benchmarked scTFBridge’s performance against baseline methods
   designed for cell-type-specific GRN inference employing single-cell
   multi-omics data, including LINGER^[136]33, scMTNI^[137]35,
   scMultiomeGRN^[138]34, and SCENIC+^[139]37. As a case study, we
   analyzed the trans-regulatory scores of STAT1 across all single cells
   (Fig. [140]5a) and examined its paired trans-regulatory TGs in CD14
   monocytes. scTFBridge outperformed all baseline methods except LINGER,
   achieving an AUC score of 0.693, and an AUPR ratio of 2.221
   (Fig. [141]5b, c). Across the benchmarking dataset, scTFBridge
   consistently surpassed the other baseline methods, yielding higher AUC
   scores and AUPR ratios for all tested TFs (Fig. [142]5d, e).

Fig. 5. Benchmarking analysis of cell-type-specific trans-regulatory
inference in PBMCs.

   [143]Fig. 5
   [144]Open in a new tab

   a UMAP visualization of modality-shared embedding of PBMCs. b, c
   Receiver operating characteristic curve and precision-recall curve of
   trans-regulatory potential inference of STAT1 in CD14 Mono cells. d, e
   Box plots comparing AUC and AUPR ratio scores across various methods
   for identifying putative TF target genes using ChIP-seq data.
   Statistical significance was evaluated with two-sided paired Student’s
   t-test, where *** indicates a p-value < 0.001. The sample size is the
   number of putative TFs with 17. For d, p-values are 7.2e^-3, 4.4e^-5,
   1.2e^-5, and 4.8e^−6 for LINGER, SCENIC+, scMTNI, and scMultiomeGRN.
   For e, p-values are 9.1e^-3, 1.1e^-4, 9e^-6 and 4.4e^−6, respectively.
   f Top TF-regulated gene expression validated by ChIP-seq data across
   different cell types, including ETS1 in CD4+ T cells, STAT1 in CD14+
   monocytes, and PAX5 in naïve B cells. g Violin plots showing cell-level
   trans-regulatory scores of specific TF-TG pairs across different cell
   types. The sample size is the number of cells for each cell type,
   ranging from 25 to 511 cells. h Network representations of inferred
   TF-TG pairs in CD14 Mono and naïve B cells. Node color corresponds to
   gene expression levels, while edge color represents regulatory scores.
   Source data are provided as a Source Data file.

   Unlike LINGER, which achieves high performance through pre-training on
   extensive external bulk-seq datasets and using pre-trained gene
   regulatory scores for GRN inference. scTFBridge operates as a
   self-contained, self-supervised framework requiring no external
   training data. This streamlined and versatile approach makes scTFBridge
   highly adaptable to datasets without high-quality bulk data. We
   conducted a comprehensive performance comparison of specific TFs and
   their top-ranked target genes (Supplementary Data [145]7 and Methods).
   For gene sets within the top 150 ranks, LINGER and scTFBridge showed
   comparable performance. However, for certain TFs (IRF1, SPI1, BATF, and
   RUNX3), scTFBridge exhibited superior performance across most ranges
   (Supplementary Fig. [146]7), demonstrating its robustness and broad
   applicability across diverse TFs.

   Next, we explored trans-regulatory associations in other single-cell
   groups. Based on annotated TF-TG associations, we selected ETS1, STAT1,
   and PAX5 as the dominant regulatory TF in CD4 T cells, monocytes, and B
   cells, respectively, along with their most associated TGs. Comparing TG
   expression levels revealed cell-type-specific patterns of gene
   expression for these TF-TG pairs (Fig. [147]5f). We further evaluated
   the regulatory scores predicted by scTFBridge and found that the model
   effectively captured cell-type-specific trans-regulatory associations
   (Fig. [148]5g). To visualize these relationships, we reconstructed the
   inferred trans-regulatory networks for cell-type-dominant TFs and their
   associated TGs. For instance, we identified the STAT1-ZEB2 regulatory
   link in CD14 Monocytes and BATF-BANK1 regulatory link in B cells
   (Fig. [149]5h). These results underscore scTFBridge’s ability to
   resolve cell-type-specific trans-regulatory activities at single-cell
   resolution.

scTFBridge uncovers cell-type-specific gene regulations in lung cancer
susceptibility

   To explore disease-specific single-cell gene regulation, we applied
   scTFBridge to a single-cell multi-omics dataset derived from lung
   cancer^[150]58. This dataset comprises paired RNA-seq and ATAC-seq data
   collected from tumor-distant normal lung tissues of ever- and
   never-smokers (n = 8 per group). The modality-shared embedding of these
   single cells revealed a distinct separation based on annotated cell
   types from Long et al. (Fig. [151]6a). In contrast, the RNA-private and
   ATAC-private embeddings exhibited less cellular information, with the
   RNA-private embedding showing particularly reduced information content
   (Supplementary Fig. [152]8).

Fig. 6. Cell-type-specific gene regulations in lung cancer susceptibility
uncovered by scTFBridge.

   [153]Fig. 6
   [154]Open in a new tab

   a UMAP visualization of the modality-shared embedding of lung
   single-cell dataset. b Regulation scores of tier-6 linkage variants and
   candidate cis-regulatory elements associated with target genes across
   different cell types. Dot color indicates cell-type categories, and dot
   size represents the regulation scores. c UMAP visualization of NRG1
   gene expression of lung single-cell dataset as in (a). d UMAP
   visualization of the regulation scores for the variant rs4236709-linked
   peak across different cell types as in (a). e Visualization of peak
   accessibility and regulation scores within the NRG1 locus across
   various cell types. Dot color corresponds to the regulatory score of a
   specific peak to NRG1. Source data are provided as a Source Data file.

   Long et al. previously defined high-confidence RE-TG pairs (tier 6)
   that included lung cancer candidate causal variants (CCVs) colocalized
   REs, and their TGs. This group proposed potential gene regulatory
   patterns contributing to lung cancer susceptibility. To validate
   scTFBridge’s ability to uncover cell-type-specific cancer
   susceptibility cis-regulations, we compared the inferred regulatory
   scores of identified RE-TG pairs across predefined calls in cell
   subtypes and other cell types (Fig. [155]6b). The results demonstrated
   that scTFBridge effectively captures higher cell-specific
   cis-regulations for these RE-TG pairs.

   Specifically, we visualized the expression of the target gene NRG1
   across all single cells and evaluated the regulatory score inferred by
   scTFBridge for a CCV-specific peak (chr8: 32552186–32552663)
   (Fig. [156]6c). The findings confirmed that scTFBridge accurately
   reflects cell-type-specific cis-regulation. Furthermore, a comparison
   of the inferred regulatory scores across peaks within the NRG1 gene
   region also indicated that the CCV-colocalized peak exhibited the
   highest regulatory potential (Fig. [157]6d, e). To further investigate
   context-dependent gene regulation, we extended our analysis to
   smoking-responsive CREs, the results confirmed reduced regulatory
   scores for selected RE-TG pairs in cells from the ever-smoker group
   (Supplementary Fig. [158]9). These results underscore the capability of
   the scTFBridge model as a powerful tool for identifying gene regulatory
   events underlying human disease susceptibility.

Discussion

   Inferring dynamic gene regulatory processes at single-cell resolution
   advances our understanding of complex cellular regulation during
   development and disease, while also facilitating the discovery of
   genotype-phenotype associations^[159]14,[160]59. Although multiple
   methods have been developed to infer GRNs using scRNA-seq or scATAC-seq
   datasets^[161]37,[162]60, these approaches are often limited by the
   unpaired nature of the datasets and the high sparsity of single-cell
   data, making it challenging to accurately characterize complex
   cell-type-specific GRNs. The advent of single-cell multi-omics
   techniques allows for the simultaneous profiling of individual cells at
   both the transcriptomic (RNA-seq) and epigenomic (ATAC-seq) levels.
   This comprehensive characterization necessitates the development of
   computational methods tailored for single-cell multi-omics profiling.
   These datasets facilitate learning joint embeddings of single-cell
   features^[163]2, establishing associations between REs and TG
   expression^[164]61, and inferring GRNs^[165]33,[166]37. Additionally,
   advances in biological deep learning for single-cell omics further
   provide more accurate representations of single-cell data, enhancing
   interpretability and offering deeper insights into cellular regulatory
   mechanisms^[167]17,[168]24,[169]62.

   In this study, we introduced scTFBridge, an interpretable deep
   generative model designed to infer GRNs from single-cell multi-omics
   data. Leveraging mutual information theory, scTFBridge effectively
   separates the latent variable space into modality-shared and
   modality-private subspaces, addressing key challenges posed by modality
   gaps and information redundancy in multi-omics
   integration^[170]39,[171]63. Furthermore, scTFBridge incorporates prior
   knowledge of TF-motif binding to regularize the biological learning of
   modality-shared embedding. This approach enables the model to capture
   TF-RE interactions within the shared embeddings, enhancing
   interpretability. Notably, each latent variable in the modality-shared
   embedding corresponds to a specific TF, forming a bridge between
   RNA-seq and ATAC-seq data to support accurate GRN inference. This
   architecture not only improves GRN inference accuracy but also ensures
   that the representations remain biologically interpretable.

   We applied scTFBridge to single-cell multi-omics datasets derived from
   BMMCs^[172]49, rheumatoid arthritis cells^[173]52, PBMCs, and lung
   cells^[174]58 to evaluate its performance across diverse gene
   regulatory tasks. Leveraging a deep-learning model interpretation
   framework, scTFBridge effectively identified cell-type-specific TFs and
   uncovered the contributions of various TFs and REs to TG regulation in
   different cell types. Using promoter-capture Hi-C (pcHi-C), eQTL, and
   ChIP-seq data from blood as ground truth, scTFBridge demonstrated
   superior performance compared to most baseline methods in inferring
   cis- and trans-regulatory interactions in the PBMC dataset.
   Furthermore, analysis of a lung cell multi-omics dataset demonstrated
   scTFBridge’s ability to identify candidate susceptibility genes
   associated with causal variants within cell-specific REs. These results
   highlight scTFBridge’s efficacy in disentangling complex regulatory
   features, thereby advancing single-cell GRN inference.

   Despite its contributions, this study has several limitations. Unlike
   LINGER^[175]33, Enformer^[176]64, and NvwaCE^[177]65 models, which
   leverage large-scale pretrained data, scTFBridge was trained from
   scratch, potentially limiting its ability to uncover established gene
   regulatory mechanisms. The benchmarking analysis primarily used
   blood-derived datasets due to the limited availability of standardized
   ground truth data for other tissues and cell types, reducing the
   method’s generalizability. Furthermore, our analysis concentrated on a
   simplified model of TF and RE binding to disentangle each latent
   variable in modality-shared space, potentially limiting its
   applicability to more complex TF regulatory contexts. Although
   scTFBridge effectively extracts latent variables for RNA-seq and
   ATAC-seq data, it does not account for regulatory mechanisms involving
   microRNAs, non-coding RNAs, or private peaks, necessitating further
   exploration and complementary methodologies^[178]2. While the CMI
   module’s capacity for information disentanglement was evaluated,
   achieving complete and accurate information separation in multi-omics
   data remains a significant challenge. To address this, researchers can
   adapt the modality-reconstruction task^[179]32 to evaluate the
   information constraints within specific latent embeddings and conduct
   latent space independence tests to evaluate the information
   redundancy^[180]66. A promising approach involves using a formal
   information-theoretic decomposition framework to quantify multimodal
   interactions, independent of prior biological understanding^[181]67.
   This method employs mathematical estimators to measure redundancy
   (shared information), uniqueness (private information), and synergy
   (emergent information), offering a rigorous, quantitative way to
   evaluate the separation of shared and private components in multimodal
   data.

   In summary, scTFBridge provides a computational framework for inferring
   both cis- and trans-regulatory GRNs from single-cell multi-omics data.
   The core principle of biologically informed multi-modal disentanglement
   can be adapted to other single-cell multi-omics applications. This work
   emphasizes the importance of developing interpretable, biologically
   grounded multi-modal integration methods that advance our understanding
   of molecular regulatory processes at the single-cell level.

Methods

Framework of scTFBridge

   scTFBridge is a deep learning-based computational framework designed to
   integrate single-cell multi-omics data and infer cell-type-specific
   GRNs. The framework utilizes a hybrid multimodal VAE architecture to
   decompose the data into shared and private latent spaces. The shared
   latent space captures regulatory interactions among TFs, REs, and TGs,
   and the private latent space encodes modality-specific information,
   such as non-TF regulatory factors or other cellular-specific features
   unique to each data modality.

Disentangled Multi-Modal VAE

   For a given cell
   [MATH: <mi>n</mi> :MATH]
   , the observed multi-omics input data is denoted as
   [MATH: <mi mathvariant="bold">x</mi><mo>=</mo><mfenced close="}"
   open="{"><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub></mrow></
   mfenced> :MATH]
   , where
   [MATH: <msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>r</mi></mrow></msub> :MATH]
   represents RNA-seq data and
   [MATH: <msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub> :MATH]
   represent ATAC-seq data. we define a low-dimension latent variable
   [MATH: <mi mathvariant="bold">z</mi><mo>=</mo><mfenced close="}"
   open="{"><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>a</mi></mrow></msub></mrow></
   mfenced> :MATH]
   to represent the observed latent omics. The latent space embedding
   [MATH: <mi mathvariant="bold">z</mi> :MATH]
   is divided into modality-shared and private representations, factorized
   as
   [MATH: <msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>=</mo
   ><mrow><mo>{</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi><mi>r</mi></mrow></msub
   ><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi><mi>r</mi></mrow></msub
   ></mrow><mo>}</mo></mrow> :MATH]
   ,
   [MATH: <msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>a</mi></mrow></msub><mo>=</mo
   ><mrow><mo>{</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi><mi>a</mi></mrow></msub
   ><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi><mi>a</mi></mrow></msub
   ></mrow><mo>}</mo></mrow> :MATH]
   . For a single modality (e.g., RNA-seq), the goal of the deep
   generative model is to maximize the likelihood of the observed data
   conditioned on the latent variables. This likelihood is expressed as
   Eq. ([182]1):
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mo>=</mo
   ><mo>∫</mo><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>∣</mo><mi
   mathvariant="bold">z</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mi>p</mi
   ><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi></mrow></mfenced><mi>d</mi><mi
   mathvariant="bold">z</mi> :MATH]
   1

   Here,
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi></mrow></mfenced> :MATH]
   is the prior distribution, typically set as a standard Gaussian
   distribution, and
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>∣</mo><mi
   mathvariant="bold">z</mi><mo>;</mo><mi>θ</mi></mrow></mfenced> :MATH]
   represents the process of generating the observed data from latent
   variable, and
   [MATH: <mi>θ</mi> :MATH]
   denotes the learnable parameters in the decoder. Direct computation of
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>θ</mi></mrow></mfenced> :MATH]
   is challenging, so we introduced a variational posterior
   [MATH: <mi>q</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced> :MATH]
   to approximate the true poster distribution
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced> :MATH]
   , where
   [MATH: <mi>ϕ</mi> :MATH]
   are the parameters of data encoder. Using this approximation, Eq.
   ([183]1) can be rewritten as:
   [MATH: <mi>log</mi><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mo>=</mo
   ><mi>log</mi><mo>∫</mo><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>∣</mo><mi
   mathvariant="bold">z</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mi>p</mi
   ><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi></mrow></mfenced><mi>d</mi><mi
   mathvariant="bold">z</mi><mo>=</mo><mi>log</mi><mo>∫</mo><mi>q</mi><mfe
   nced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced><mfrac><m
   row><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>∣</mo><mi
   mathvariant="bold">z</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mi>p</mi
   ><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi></mrow></mfenced></mrow><mrow><mi>q</mi><mfenc
   ed close=")" open="("><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced></mrow></
   mfrac><mi>d</mi><mi mathvariant="bold">z</mi> :MATH]
   2

   Using Jensen’s inequality, this leads to the evidence lower bound
   (ELBO):
   [MATH: <mi>log</mi><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>θ</mi></mrow></mfenced><mo>=</mo
   ><msub><mrow><mi>E</mi></mrow><mrow><mi>q</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced></mrow></
   msub><mfenced close="]"
   open="["><mrow><mi>log</mi><mfrac><mrow><mi>p</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">z</mi><mo>,</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow><mrow><mi>q</mi><mfenc
   ed close=")" open="("><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced></mrow></
   mfrac></mrow></mfenced><mo>+</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow><mo>)</mo></mrow><m
   o>∣</mo><mo>∣</mo><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mro
   w> :MATH]
   3

   Since
   [MATH: <mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow><mo>)</mo></mrow><m
   o>∣</mo><mo>∣</mo><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mro
   w><mo>≥</mo><mn>0</mn> :MATH]
   , the ELBO is given by:
   [MATH: <msub><mrow><mi
   mathvariant="normal">ELBO</mi><mo>=</mo><mi>E</mi></mrow><mrow><mi>q</m
   i><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced></mrow></
   msub><mrow><mo>[</mo><mrow><mi>log</mi><mi>p</mi><mrow><mo>(</mo><mrow>
   <mi mathvariant="bold">x</mi><mo>∣</mo><mi
   mathvariant="bold">z</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow></
   mrow><mo>]</mo></mrow><mo>−</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><mi mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow><mo>)</mo></mrow><m
   o>∣</mo><mo>∣</mo><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">z</mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mro
   w> :MATH]
   4

   Here,
   [MATH: <mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mo>∣</mo><
   mo>∣</mo><mi>p</mi></mrow><mo>)</mo></mrow> :MATH]
   represents the Kullback–Leibler divergence between
   [MATH: <mi>q</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced> :MATH]
   and prior
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">z</mi></mrow></mfenced> :MATH]
   . To maximize the log-likelihood, we can maximize the ELBO.

   To disentangle the latent variables and address batch effects, the
   learning objective for each omics layer represents as follows:
   [MATH:
   <mi>E</mi><mi>L</mi><mi>B</mi><mi>O</mi><mo>=</mo><msub><mrow><mi>E</mi
   ></mrow><mrow><mfenced close=")" open="("><mrow><mi>q</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>∣</mo
   ><mi
   mathvariant="bold">x</mi><mo>;</mo><mi>ϕ</mi></mrow></mfenced></mrow></
   mfenced></mrow></msub><mrow><mo>[</mo><mrow><mi>log</mi><mi>p</mi><mrow
   ><mo>(</mo><mrow><mi mathvariant="bold">x</mi><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>,</mo
   ><mi
   mathvariant="bold">c</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow></
   mrow><mo>]</mo></mrow><mo>−</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><mi
   mathvariant="bold">x</mi><mo>;</mo><msub><mrow><mi>ϕ</mi></mrow><mrow><
   mi>s</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>∣</mo><mo>∣</mo><mi>
   p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow><m
   o>)</mo></mrow></mrow><mo>)</mo></mrow><mo>+</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>∣</mo
   ><mi
   mathvariant="bold">x</mi><mo>;</mo><msub><mrow><mi>ϕ</mi></mrow><mrow><
   mi>p</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>∣</mo><mo>∣</mo><mi>
   p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>p</mi></mrow></msub></mrow><m
   o>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   5

   Here,
   [MATH: <mi mathvariant="bold">c</mi> :MATH]
   is the batch label used as a conditional variable in the generative
   process.

   For the modality-shared representation, we use a product-of-experts
   (PoE) model to compute the joint variational posterior distribution for
   each omics modality:
   [MATH: <mi>q</mi><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><mi
   mathvariant="bold">x</mi></mrow></mfenced><mo>=</mo><mi>p</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow></
   mfenced><munderover accent="false"
   accentunder="false"><mrow><mo>∏</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi>q</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   mfenced> :MATH]
   6

   The prior
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow></
   mfenced> :MATH]
   is modeled as a Gaussian distribution
   [MATH: <mi>N</mi><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow></msub></
   mrow></mfenced> :MATH]
   , and each modality-specific posterior
   [MATH: <mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>x</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><m
   o>)</mo></mrow> :MATH]
   is also Gaussian, with mean
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><msub><mrow><mi>s</mi></mr
   ow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   and variance
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><msub><mrow><mi>s</mi></mr
   ow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   . The joint distribution is defined by:
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   o>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow><mrow><mo
   >−</mo><mn>1</mn></mrow></msubsup><mo>+</mo><munder><mrow><mo>∑</mo></m
   row><mrow><mi>i</mi><mo>∈</mo><mfenced close="}"
   open="{"><mrow><mi>r</mi><mo>,</mo><mi>a</mi></mrow></mfenced></mrow></
   munder><msubsup><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >−</mo><mn>1</mn></mrow></msubsup></mrow></mfenced></mrow><mrow><mo>−</
   mo><mn>1</mn></mrow></msup> :MATH]
   7
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   o>=</mo><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   subsup><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow><mrow><mo
   >−</mo><mn>1</mn></mrow></msubsup><mo>+</mo><munder><mrow><mo>∑</mo></m
   row><mrow><mi>i</mi><mo>∈</mo><mfenced close="}"
   open="{"><mrow><mi>r</mi><mo>,</mo><mi>a</mi></mrow></mfenced></mrow></
   munder><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   subsup><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >−</mo><mn>1</mn></mrow></msubsup></mrow></mfenced><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>s</mi></mrow></msub>
   :MATH]
   8

   The overall learning objective for multi-modal integration during
   training is defined as:
   [MATH: <munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mi
   mathvariant="normal">ϵ</mi><mfenced close="}"
   open="{"><mrow><mi>r</mi><mo>,</mo><mi>a</mi></mrow></mfenced></mrow></
   munder><mrow><mo>(</mo><mrow><mrow><mo>[</mo><mrow><msub><mrow><mi>E</m
   i></mrow><mrow><mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>;</mo
   ><msub><mrow><mi>ϕ</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mrow>
   <mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow><mo>,</mo
   ><mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>;</mo
   ><msub><mrow><mi>ϕ</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mrow>
   <mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow></mrow></
   msub><mrow><mo>[</mo><mrow><mi>log</mi><mi>p</mi><mrow><mo>(</mo><mrow>
   <msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>,</mo><mi
   mathvariant="bold">c</mi><mo>;</mo><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mo>]</mo></mrow>
   <mo>−</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>;</mo
   ><msub><mrow><mi>ϕ</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mrow>
   <mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow><mo>∣</mo
   ><mo>∣</mo><mi>p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow></mrow>
   <mo>)</mo></mrow><mo>−</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>;</mo
   ><msub><mrow><mi>ϕ</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mrow>
   <mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow><mo>∣</mo
   ><mo>∣</mo><mrow><mo>(</mo><mrow><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow></mrow>
   <mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mo>]</mo></mrow></mrow
   ><mo>)</mo></mrow> :MATH]
   9

Cross mutual information Block

   To achieve a well-disentangled multi-modal latent space, we incorporate
   mutual information (MI) theory to decouple shared and private
   representations. The loss function for this block is defined as in Eq.
   ([184]10):
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>C</mi><mi>I</mi><mi>D</mi></mrow
   ></msub><mo>=</mo><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∈<
   /mo><mfenced close="}"
   open="{"><mrow><mi>r</mi><mo>,</mo><mi>a</mi></mrow></mfenced></mrow></
   munder><mi>I</mi><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub></mrow></mfenced><mo>+</mo><mi
   mathvariant="normal">I</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>a</mi></mrow></msub></mrow></msub></mrow></mfenced> :MATH]
   10

   Here,
   [MATH: <mi>I</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>i</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   measures the MI between the private and shared latent embeddings within
   a single modality
   [MATH: <mi>i</mi> :MATH]
   ,
   [MATH: <mi>I</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>a</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   measures the MI between private embeddings across modalities. The goal
   is to minimize the MI between modality-shared and private embeddings
   within a modality and between private embeddings across modalities. The
   MI between two variables
   [MATH: <mi>x</mi> :MATH]
   and
   [MATH: <mi>y</mi> :MATH]
   is defined as in Eq. ([185]11):
   [MATH: <mi>I</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced><mo>=</mo><msub><mrow><mi>E</
   mi></mrow><mrow><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced></mrow></msub><mfenced
   close="]" open="["><mrow><mi>log</mi><mfrac><mrow><mi>p</mi><mfenced
   close=")" open="("><mrow><mi mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced></mrow><mrow><mi>p</mi><mfenc
   ed close=")" open="("><mrow><mi
   mathvariant="bold">x</mi></mrow></mfenced><mi>p</mi><mfenced close=")"
   open="("><mrow><mi
   mathvariant="bold">y</mi></mrow></mfenced></mrow></mfrac></mrow></mfenc
   ed> :MATH]
   11

   Since mutual information is generally intractable, we use an upper
   bound known as the Contrastive Log-ratio Upper Bound (CLUB). This
   approximation for
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   is described as in Eq. ([186]12):
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced><mo>=</mo><msub><mrow><mi>E</
   mi></mrow><mrow><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced></mrow></msub><mfenced
   close="]" open="["><mrow><mi>log</mi><mi>p</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></mfenced><mo>−</mo><m
   sub><mrow><mi>E</mi></mrow><mrow><mi>p</mi><mfenced close=")"
   open="("><mrow><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></msub><msub><mrow><mi
   >E</mi></mrow><mrow><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">y</mi></mrow></mfenced></mrow></msub><mfenced
   close="]" open="["><mrow><mi>log</mi><mi>p</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></mfenced> :MATH]
   12

   To validate that
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   is an upper bound of
   [MATH: <mi>I</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   , we compute the difference
   [MATH: <mi
   mathvariant="normal">Δ</mi><mo>=</mo><msub><mrow><mi>I</mi></mrow><mrow
   ><mi>C</mi><mi>L</mi><mi>U</mi><mi>B</mi></mrow></msub><mo>−</mo><mi>I<
   /mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   as in Eq. ([187]13):
   [MATH: <mi
   mathvariant="normal">Δ</mi><mo>=</mo><msub><mrow><mi>E</mi></mrow><mrow
   ><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>)</mo></mrow></mrow></msub><mrow><m
   o>[</mo><mrow><mi>log</mi><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>)</mo></mrow><mo>−</mo><msub><mrow>
   <mi>E</mi></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow></mrow></msub><mrow><m
   o>[</mo><mrow><mi>log</mi><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow></mrow><mo>]</mo></mro
   w></mrow><mo>]</mo></mrow> :MATH]
   13

   Using the marginal distribution
   [MATH: <mi>log</mi><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>)</mo></mrow><mo>=</mo><mi>log</mi>
   <mo>∫</mo><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow><mi>p</mi><mrow><mo>(<
   /mo><mrow><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow><mi>d</mi><mi
   mathvariant="bold">x</mi><mo>=</mo><mi>log</mi><mrow><mo>(</mo><mrow><m
   sub><mrow><mi>E</mi></mrow><mrow><mi>p</mi><mfenced close=")"
   open="("><mrow><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></msub><mrow><mo>[</mo
   ><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow><mo>)</mo></mrow></mrow><mo>]</mo></mro
   w></mrow><mo>)</mo></mrow> :MATH]
   and Jensen’s inequality, we derive:
   [MATH: <mi>log</mi><mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">y</mi></mrow></mfenced><mo>=</mo><mi>log</mi><mfence
   d close=")"
   open="("><mrow><msub><mrow><mi>E</mi></mrow><mrow><mi>p</mi><mfenced
   close=")" open="("><mrow><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></msub><mfenced
   close="]" open="["><mrow><mi>p</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></mfenced></mrow></mfe
   nced><mo>≥</mo><msub><mrow><mi>E</mi></mrow><mrow><mi>p</mi><mfenced
   close=")" open="("><mrow><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></msub><mfenced
   close="]" open="["><mrow><mi>log</mi><mi>p</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced></mrow></mfenced> :MATH]
   14

   This shows that
   [MATH: <mi mathvariant="normal">Δ</mi><mo>≥</mo><mn>0</mn> :MATH]
   , confirming that
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>,</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   is indeed an upper bound of
   [MATH: <mi>I</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">x</mi><mo>;</mo><mi
   mathvariant="bold">y</mi></mrow></mfenced> :MATH]
   which can be computed as:
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</
   mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi>log</mi><mi>p</m
   i><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   mfenced><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msup><mrow><mi>N
   </mi></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi>log</mi><mi>p</m
   i><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   mfenced> :MATH]
   15

   This simplifies to:
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msup><
   mrow><mi>N</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac><mun
   derover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mfenced close="]"
   open="["><mrow><mi>log</mi><mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   mfenced><mo>−</mo><mi>log</mi><mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   mfenced></mrow></mfenced> :MATH]
   16

   We employ a nonlinear neural network to estimate
   [MATH: <mi>p</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">y</mi><mo>∣</mo><mi
   mathvariant="bold">x</mi></mrow></mfenced> :MATH]
   , with the log-likelihood computed as:
   [MATH: <munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi>log</mi><mi>p</m
   i><mfenced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>;</mo
   ><mi>ϕ</mi></mrow></mfenced> :MATH]
   17

   During each training epoch, the parameters of the distribution
   estimator are first updated to maximize Eq. ([188]17), improving the
   accuracy of conditional distribution estimation. Subsequently, other
   parameters of scTFBridge are updated to minimize
   [MATH:
   <msub><mrow><mi>I</mi></mrow><mrow><mi>C</mi><mi>L</mi><mi>U</mi><mi>B<
   /mi></mrow></msub> :MATH]
   , thereby reducing the mutual information between the two variables.

TF-Binding informed ATAC-seq Decoder

   To ensure that the joint shared embedding represents transcription
   factor (TF) regulatory activities, we incorporated prior knowledge of
   TF-motif binding scores into the architecture of the ATAC-seq decoder.
   We utilized 713 TF position weight matrices (PWMs) for known motifs,
   sourced from the PECA2 GitHub repository
   ([189]https://github.com/SUwonglab/PECA). This repository aggregates
   motif data from widely used databases, including JASPAR^[190]68,
   TRANSFAC^[191]69, and UniPROBE^[192]70. From the single-cell RNA-seq
   dataset, the 128 most highly expressed TFs were selected as dimensions
   of the modality-shared embedding. Each TF corresponds to a single
   latent variable in the shared embedding. Regulatory elements (REs) were
   scanned for TF binding motifs using HOMER. This produced a binding
   score matrix
   [MATH: <mi>B</mi> :MATH]
   , which quantifies the interaction strength between each TF and the
   REs.

   The ATAC-seq reconstruction decoder is implemented as a single-layer
   neural network. The decoder receives the combined latent variables as
   input and reconstructs ATAC-seq signals (peaks) constrained by
   biological priors. The binding score matrix
   [MATH: <mi>B</mi> :MATH]
   serves as a mask to constrain the weight matrix
   [MATH: <mi>W</mi> :MATH]
   of the decoder.
   [MATH: <mi>B</mi> :MATH]
   ensures that connections from TF latent embeddings to ATAC-seq peaks
   reflect known TF-RE interactions. For each connection in the decoder,
   the weight matrix
   [MATH: <mi>W</mi> :MATH]
   is modulated by the binding score matrix
   [MATH: <mi>B</mi> :MATH]
   , as follows:
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow
   ></msub><mo>=</mo><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi><mo>,</m
   o><mi>j</mi></mrow></msub><mo>*</mo><msub><mrow><mi>B</mi></mrow><mrow>
   <mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>,</mo><mi>i</mi><mo>∈</
   mo><mfenced close="}"
   open="{"><mrow><mn>0</mn><mo>,</mo><mo>…</mo><mi>N</mi><mfenced
   open="]"><mrow></mrow></mfenced></mrow></mfenced><mo>,</mo><mi>j</mi><m
   o>∈</mo><mfenced close="}"
   open="{"><mrow><mn>0</mn><mo>,</mo><mo>…</mo><mi>K</mi><mfenced
   open="]"><mrow></mrow></mfenced></mrow></mfenced> :MATH]
   18

   Where
   [MATH: <mi>N</mi> :MATH]
   is the number of TF dimensions and
   [MATH: <mi>K</mi> :MATH]
   is the number of REs in ATAC-seq.

Cross modal inference

   In addition to modality-private self-reconstruction, the scTFBridge
   training process incorporates cross-modal generation to learn
   cis-regulatory relationships between ATAC-seq and RNA-seq. This feature
   enables inference across modalities, allowing the model to predict
   missing data in one modality using information from the other. When
   RNA-seq data is missing, the Product-of-Experts (PoE) approach is
   employed to use the shared latent distribution of ATAC-seq
   [MATH: <mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>s</mi></mrow><mro
   w><mi>a</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub><mo>,</mo
   ><mi
   mathvariant="bold">c</mi><mo>;</mo><mi>ϕ</mi></mrow><mo>)</mo></mrow>
   :MATH]
   as the multi-modal joint distribution for cross-omics generation. The
   missing modality-private latent variable
   [MATH: <msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub> :MATH]
   is reparametrized from a prior normal Gaussian distribution
   [MATH: <mi>N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>∣</mo><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow></msub></
   mrow><mo>)</mo></mrow> :MATH]
   . Using (
   [MATH: <msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub> :MATH]
   ), the model recovers the RNA-seq data by optimizing the following
   objective:
   [MATH: <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>~</mo><mi>N</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow></msub></
   mrow></mfenced><mo>,</mo><mi>q</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub><mo>;</mo
   ><mi>ϕ</mi></mrow></mfenced></mrow></msub><mrow><mo>[</mo><mrow><mi>log
   </mi><mi>p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>r</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>,</mo
   ><mi
   mathvariant="bold">c</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow></
   mrow><mo>]</mo></mrow><mo>−</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub><mo>;</mo
   ><mi>ϕ</mi></mrow><mo>)</mo></mrow><mo>∣</mo><mo>∣</mo><mi>p</mi><mrow>
   <mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow><m
   o>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   19

   Similarly, we can derive the objective function for generating ATAC-seq
   data from RNA-seq:
   [MATH: <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>a</mi></mrow></msub></mrow></msub><mo>~</mo><mi>N</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mn>0</mn></mrow></msub></
   mrow></mfenced><mo>,</mo><mi>q</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>;</mo
   ><mi>ϕ</mi></mrow></mfenced></mrow></msub><mrow><mo>[</mo><mrow><mi>log
   </mi><mi>p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>a</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>a</mi></mrow></msub></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>,</mo
   ><mi
   mathvariant="bold">c</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow></
   mrow><mo>]</mo></mrow><mo>−</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>∣</mo
   ><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>r</mi></mrow></msub><mo>;</mo
   ><mi>ϕ</mi></mrow><mo>)</mo></mrow><mo>∣</mo><mo>∣</mo><mi>p</mi><mrow>
   <mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow><m
   o>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   20

Learning Object

   In general, the final optimizing objective of scTFBridge can be
   expressed as follows:
   [MATH:
   <mi>L</mi><mo>=</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>V</mi><mi>A<
   /mi><mi>E</mi></mrow></msub><mo>+</mo><mi>β</mi><mo>*</mo><msub><mrow><
   mi>L</mi></mrow><mrow><mi>M</mi><mi>I</mi></mrow></msub><mo>+</mo><mi>γ
   </mi><mo>*</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>c</mi><mi>r</mi><
   mi>o</mi><mi>s</mi><mi>s</mi><mspace
   width="0.16em"></mspace><mi>m</mi><mi>o</mi><mi>d</mi><mi>a</mi><mi>l</
   mi></mrow></msub> :MATH]
   21

   The first term optimizes the disentangled VAE for learning
   omics-specific representations and accurate data reconstruction, it
   measures how well the model reconstructs each modality from the latent
   representations and regularizes the latent space by minimizing the
   divergence between the latent distributions and the prior Gaussian
   distribution. The second term evaluates and promotes the
   disentanglement of the shared and private latent representations by
   minimizing the MI between them. The third term ensures that the model
   can perform effective cross-modal inference, using the shared and
   reparametrized latent variables. These terms collectively ensure that
   scTFBridge achieves biologically meaningful integration,
   disentanglement, and cross-omics prediction.

Single cell multi-omics data preprocessing

   In this study, we implemented a single-cell multi-omics data
   preprocessing workflow using the Scanpy package^[193]71. For the
   scRNA-seq data, TFs were isolated, normalized, and log-transformed. We
   then selected the top 128 highly variable TFs for focused embedding in
   the shared TF space. Additionally, the top 3000 highly expressed genes
   were identified as (TGs). These TGs were normalized and log-transformed
   to serve as input for scTFBridge. For scATAC-seq data, chromatin
   accessibility peaks were binarized and filtered to retain peaks with
   open accessibility in at least 1% of all cells. To ensure accurate
   integration, only cells represent in both datasets were included in the
   analysis.

SHAP value for cell-level and cell-type-specific GRN inference

   SHAP values provide a robust framework for quantifying the
   contributions of features in deep learning models and are widely used
   for biologically interpretable biomarker discovery. Building on prior
   studies^[194]48, we employed the reconstruction loss between original
   and imputed TG expression values as the objective for computing SHAP
   values. for input features, focusing on their contribution to the
   reconstruction rather than direct expression effects. By focusing on
   the reconstruction error, scTFBridge highlights features that are
   challenging for the model to learn, revealing complex or subtle
   regulatory relationships in cell-level and cell-type-specific GRNs.

   For cis-regulatory contributions, we calculated regulatory elements
   (REs) associated with TGs via cross-omics generation. The
   cis-regulatory score for a RE-TG pair was defined as:
   [MATH:
   <mi>R</mi><mi>E</mi><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><msub><mrow
   ><mi>e</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo
   >=</mo><msub><mrow><mi>ρ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi
   ></mrow></msub><msub><mrow><mi>O</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><msub><mrow><mi>E</mi></mrow><mrow><mi>i</mi></mrow></msub><msup><mr
   ow><mi>e</mi></mrow><mrow><mo>−</mo><mfrac><mrow><msub><mrow><mi>d</mi>
   </mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><mrow><
   msub><mrow><mi>d</mi></mrow><mrow><mn>0</mn></mrow></msub></mrow></mfra
   c></mrow></msup> :MATH]
   22

   Where
   [MATH:
   <msub><mrow><mi>ρ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   is the SHAP value of RE
   [MATH: <mi>j</mi> :MATH]
   and TG
   [MATH: <mi>i</mi> :MATH]
   ;
   [MATH: <msub><mrow><mi>O</mi></mrow><mrow><mi>j</mi></mrow></msub>
   :MATH]
   is the average chromatin accessibility of this cell type;
   [MATH: <msub><mrow><mi>E</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   is the average expression of TG
   [MATH: <mi>i</mi> :MATH]
   in the cell type;
   [MATH:
   <msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   is the distance between genomic locations of
   [MATH: <mi>j</mi> :MATH]
   and
   [MATH: <mi>i</mi> :MATH]
   ; and
   [MATH: <msub><mrow><mi>d</mi></mrow><mrow><mn>0</mn></mrow></msub>
   :MATH]
   is the normalization factor for genomic distance, set to 30,000 bp in
   this study.

   For trans-regulatory contributions, we quantified the influence of each
   TF variable in the shared embedding space on TGs. The trans-regulatory
   score was directly defined as the SHAP value:
   [MATH:
   <msub><mrow><mi>T</mi><mi>F</mi><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
   ><mi>e</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>j</mi></mrow></msub><mo
   >=</mo><msub><mrow><mi>α</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>i</mi
   ></mrow></msub> :MATH]
   23

   Where
   [MATH:
   <msub><mrow><mi>α</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>i</mi></mrow
   ></msub> :MATH]
   represent the SHAP value quantifying the contribution of TF
   [MATH: <mi>k</mi> :MATH]
   to TG
   [MATH: <mi>i</mi> :MATH]
   .

   To infer regulatory networks at the individual cell level, we extended
   the computation of SHAP values to account for unique cell-level
   characteristics. Instead of averaging values across a specific cell
   type, we calculated SHAP values using the average gene expression and
   chromatin accessibility across all cell types. This method enabled the
   reconstruction of cell-specific GRNs, providing a fine-grained view of
   regulatory interactions that account for each cell’s unique features.

Latent embedding contributions estimation

   To evaluate the contributions of modality-private and shared components
   in the generative process of RNA-seq data from latent variables, we
   employed SHAP values. For each latent dimension, the average of the
   absolute SHAP values was computed across all TGs to derive global
   contribution scores. The overall contributions of the private and
   shared embedding spaces were then estimated by summing the contribution
   scores of all latent variables within their respective spaces. To
   ensure comparability, the two contribution scores (private and shared)
   were normalized to the range [0,1] by dividing each score by the
   combined sum of both scores. These normalized proportions were then
   ranked to determine the relative influence of private and shared
   components on gene regulation.

   Using these normalized contribution proportions, we distinguished genes
   primarily regulated by private information from those predominantly
   influenced by shared features. To investigate the biological
   significance of these findings, we performed Gene Ontology biological
   process (GO:BP) enrichment analysis using the gprofiler2^[195]72
   website with selected top 100 ranked genes from each group.
   Additionally, we conducted Gene Set Enrichment Analysis (GSEA) using
   the gesapy package^[196]73 with the ranked gene lists.

Benchmarking analysis of GRN inference

   To evaluate the performance of scTFBridge in predicting cis- and
   trans-regulation, we conducted a benchmarking analysis against
   established methods using consistent dataset splits to ensure a fair
   comparison. For cis-regulation prediction, we compared scTFBridge with
   PCC, TSS distance, random prediction, and SCENIC + ^[197]37. For
   trans-regulation prediction, we evaluated scTFBridge against baseline
   methods for cell-type-specific GRN inference from single-cell
   multi-omics data: LINGER^[198]33, scMTNI^[199]35,
   scMultiomeGRN^[200]34, and SCENIC+^[201]37. To address variations in
   gene filtering criteria across methods, we used the intersection of
   genes retained by all methods for final metric calculations, ensuring a
   fair and consistent evaluation. For GRN inference, performance was
   assessed by comparing the top-k predicted target genes of specific TF
   from each method against the intersection of target genes in the
   Cistrome database^[202]57. The overlapping genes and their associated
   scores were used for performance evaluation.

Heritability enrichment analysis

   We utilized stratified linkage disequilibrium score regression
   (S-LDSC)^[203]53 to assess the heritability enrichment of various peak
   sets. First, we converted the genomic coordinates of peaks from hg38 to
   hg19 to align with the GWAS summary statistics of RA. SNPs with a minor
   allele frequency (MAF) ≥ 0.05 associated with the peak sets of interest
   were selected. Each 200 bp peak window was extended by 500 bp on either
   side, resulting in a 1.2 kb genomic interval per peak. Using S-LDSC, we
   then calculated population-specific linkage disequilibrium (LD) scores
   and quantified the proportion of RA heritability attributable to
   modality-shared and modality-private peak sets from different
   single-cell populations in the RA dataset, as analyzed using
   scTFBridge.

Evaluation metric

AUPR Ratio

   The AUPR Ratio measures the improvement of a predictor over random
   chance in terms of precision-recall performance. It is defined as the
   ratio of the Area Under the Precision-Recall (AUPR) curve of a method
   to that of a random predictor. For a random predictor, the AUPR is
   equal to the fraction of positive samples in the dataset
   [MATH: <mi>P</mi> :MATH]
   , which is defined as:
   [MATH: <mi>A</mi><mi>U</mi><mi>P</mi><mi>R</mi><mspace
   width="0.16em"></mspace><mi>R</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</
   mi><mo>=</mo><mfrac><mrow><msub><mrow><mi>A</mi><mi>U</mi><mi>P</mi><mi
   >R</mi></mrow><mrow><mi>m</mi><mi>e</mi><mi>t</mi><mi>h</mi><mi>o</mi><
   mi>d</mi></mrow></msub></mrow><mrow><mi>P</mi></mrow></mfrac> :MATH]
   24

AUC

   The AUC score evaluates the performance of a predictor in
   distinguishing between positive and negative samples. It is derived
   from the Receiver Operating Characteristic (ROC) curve, which plots the
   true positive rate (sensitivity) against the false positive rate
   (1-specificity) across thresholds.

F1 Score

   The F1 Score assesses the balance between precision and recall,
   particularly useful when dealing with imbalanced datasets. It is
   computed as the harmonic mean of precision and recall, which is defined
   as:
   [MATH: <mi>F</mi><mn>1</mn><mspace
   width="0.25em"></mspace><mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</
   mi><mo>=</mo><mn>2</mn><mo>⋅</mo><mfrac><mrow><mi>P</mi><mi>r</mi><mi>e
   </mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>⋅<
   /mo><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow>
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><mi>n</mi><mo>+</mo><mi>R</mi><mi>e</mi><mi>c</mi><mi>a<
   /mi><mi>l</mi><mi>l</mi></mrow></mfrac> :MATH]
   25

Implementation details of scTFBridge

   We implemented scTFBridge using the PyTorch deep learning library in
   Python. The dimensions of modality-shared and private embedding were
   empirically set to 128 in scTFBridge. The model was trained with the
   Adam optimizer, employing an initial learning rate of 1e-3 and weight
   decay of 1e-5, reduced by epoch. Model training was conducted with a
   batch size of 128 over 150 epochs with early stopping triggered after
   10 steps of no improvement. Hyperparameters used to balance the weights
   of each training objective, with example value was provided in
   Supplementary Data [204]8. To ensure robustness, the datasets were
   divided into five folds for cross-validation, with each fold split into
   64% for training, 16% for validation and 20% for testing. The held-out
   validation set was used to identify the best model checkpoint in early
   stopping and optimize hyperparameters. The SHAP values for GRN
   inference were computed using the test datasets. All experiments were
   conducted on a workstation equipped with two Intel Xeon Silver 4210 R
   CPUs and eight NVIDIA GeForce RTX 4090 GPUs.

Reporting summary

   Further information on research design is available in the [205]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [206]Supplementary Information^ (2.8MB, pdf)
   [207]Supplementary Data 1^ (9.1KB, xlsx)
   [208]Supplementary Data 2^ (9.4KB, xlsx)
   [209]Supplementary Data 3^ (100.3KB, xlsx)
   [210]Supplementary Data 4^ (9.3KB, xlsx)
   [211]Supplementary Data 5^ (9.1KB, xlsx)
   [212]Supplementary Data 6^ (9.3KB, xlsx)
   [213]Supplementary Data 7^ (15.8KB, xlsx)
   [214]Supplementary Data 8^ (9.3KB, xlsx)
   [215]Supplementary Data 9^ (9.2KB, xlsx)
   [216]Supplementary Data 10^ (8.9KB, xlsx)
   [217]Reporting Summary^ (265.9KB, pdf)
   [218]41467_2025_64227_MOESM13_ESM.pdf^ (189.6KB, pdf)

   Description of Additional Supplementary Files
   [219]Transparent Peer Review file^ (7.4MB, pdf)

Source data

   [220]Source Data^ (390.6KB, zip)

Acknowledgements