Graphical abstract

   graphic file with name fx1.jpg
   [34]Open in a new tab

Highlights

     * •
       moETM integrates multiple modalities to a low-dimensional
       representation
     * •
       moETM accurately achieves cross-omics imputation
     * •
       moETM-inferred topics can elucidate cell-type signatures and gene
       regulatory programs
     * •
       moETM reveals the molecular basis of disease severity on a COVID-19
       CITE-seq

Motivation

   Single-cell technologies are advancing to enable profiling of multiple
   modalities within individual cells. However, the sparsity and noise in
   the data poses challenges for analysis. Most existing computational
   methods for analyzing single-cell data are either limited to single
   modality or lack flexibility and interpretability. In this study, we
   introduce moETM, a unified deep learning model that integrates
   single-cell multi-omics data by projecting them onto a common topic
   mixture representation. Furthermore, moETM employs a linear decoder
   design, which facilitates the interpretability and the discovery of
   biologically significant patterns.
     __________________________________________________________________

   Zhou et al. develop moETM to integrate single-cell multi-omics data by
   leveraging product-of-experts framework to infer latent topics and
   linear decoders to learn multi-omics features. moETM aims to accomplish
   several objectives, including clustering and identifying sub-cell
   types, cross-omics imputation, and identifying cell-type signatures.

Introduction

   Multi-omics single-cell high-throughput sequencing technologies open up
   new opportunities to interrogate cell-type-specific gene regulatory
   programs. Single-cell RNA sequencing (scRNA-seq) combined with assay
   for transposase-accessible chromatin using sequencing (ATAC-seq)[35]^1
   simultaneously measure the transcriptome and chromatin accessibility in
   the same cell. Cellular indexing of transcriptomes and epitopes by
   sequencing (CITE-seq)[36]^2 measures surface protein and transcriptome
   data using oligonucleotide-labeled antibodies. By integrating the
   information from these multiple omics, we can expand our understanding
   of the genome regulation from multiple perspectives.

   However, extracting meaningful biological patterns from the
   fast-growing multi-omics single-cell data remains a challenge due to
   several factors.[37]^3^,[38]^4 Firstly, the cell yield of multi-omics
   single-cell technologies is lower compared with the single-omics
   technologies such as scRNA-seq. On the other hand, the combined feature
   dimension is much higher (e.g., genes and peaks). This requires a more
   deliberate model design that can flexibly distill meaningful cell-type
   signatures from the multi-modal data while not overfitting the data.
   Secondly, multi-omics single-cell data are noisier compared with
   bulk-level or single-cell single-omics data. This calls for a
   probabilistic model that can infer latent cell types while properly
   accounting for the statistical uncertainty. Thirdly, the batch effects
   make it challenging to distinguish biological signals from
   study-specific confounders. Finally, multi-omics single-cell data are
   more costly compared with scRNA-seq or scATAC-seq alone. It is
   therefore highly cost-effective if we can profile single-omics data and
   then predict the unobserved omics data.

   Recently, several computational methods were developed to tackle the
   above multi-modality data-integration challenges encountered in
   multi-omics single-cell data analysis. For instance, SMILE[39]^5
   integrates multi-omics data by minimizing the mutual information of the
   latent representations among the modalities and batches. The
   totalVI[40]^6 and multiVI[41]^7 integrate CITE-seq data and scRNA +
   scATAC data via variational autoencoder (VAE) frameworks, respectively.
   Cobolt[42]^8 is a hierarchical Bayesian generative model to integrate
   cell modalities. scMM[43]^9 is a mixture-of-experts (MoE) model
   developed to impute one missing modality conditioned on the other.
   Multigrate[44]^10 adopted a product-of-experts framework to integrate
   multi-omics data. MOFA+[45]^11 uses mean-field variational Bayes and
   coordinate ascent to fit a Bayesian group factor analysis model to
   integrate the multi-omics data. Seurat V4[46]^12 integrated multimodal
   single-cell data through the weighted nearest neighbor algorithm. While
   many of these methods conferred promising performances in some of the
   tasks such as cell clustering or modality imputation, they often need
   to compromise scalability, interpretability, and/or flexibility. In
   particular, when a neural network is used to encode the
   high-dimensional multi-omics data, interpretability is traded for
   flexibility; when a linear model or independent feature assumption is
   made, flexibility is traded for interpretability and scalability.
   However, all three are important to reveal cell-type-specific
   multi-omics signatures that are indicative of gene regulatory programs
   from large-scale data. Furthermore, most of these methods are entirely
   data driven and therefore incapable of fully utilizing the existing
   biological information such as gene annotations or pathway information.

   In this study, we present a multi-omics embedded topic model (moETM) to
   integrate multiple molecular modalities at the single-cell level. As
   one of the main technical contributions, moETM uses product-of-experts
   to infer latent topics underlying the single-cell multi-omics data and
   a set of linear decoders to learn shared embedding of topics and
   multi-omics features that can accurately reconstruct the
   high-dimensional multi-omics data from their low-dimensional latent
   topic space ([47]Figure 1A). Through effectively integrating multiple
   modalities from multi-omics single-cell sequencing data, moETM seeks to
   achieve three tasks: (1) clustering cells into biologically meaningful
   clusters to identify the sub-cell type indicative of phenotype of
   interests ([48]Figure 1B), (2) imputing one omics using the other omics
   ([49]Figure 1C), and (3) identifying cell-type signatures, which serve
   as biomarkers for a target phenotype ([50]Figure 1D). Through
   comprehensive experiments on seven single-cell multi-omics datasets, we
   demonstrate moETM’s ability comparatively with six state-of-the-art
   computational methods. We further showcase how moETM facilitates the
   analysis of the COVID-19 single-cell CITE-seq dataset. Quantitatively,
   we observe that moETM learns the joint embeddings from multiple
   modalities with better or comparable bio-conservation, batch effect
   correction, and crossmodality imputation compared with the existing
   methods.[51]^5^,[52]^6^,[53]^7^,[54]^8^,[55]^9^,[56]^11^,[57]^12
   Furthermore, the topic embedding learned by moETM can be used to gain
   biological insights into the cell-type-specific multi-omics regulatory
   elements.

Figure 1.

   [58]Figure 1
   [59]Open in a new tab

   moETM model overview

   (A) Modeling single-cell multi-omics data across batches. For details,
   see STAR Methods.

   (B) Evaluating moETM through cell clustering. The integration
   performance of moETM is evaluated by clustering cells based on their
   topic proportion and qualitatively evaluated by UMAP visualization.

   (C) Cross-modality imputation.

   (D) Downstream topic analysis. The learned topics-by-{cells, genes,
   proteins, peaks} matrices enable identifying cell-type-specific topics,
   gene signatures, surface protein signatures, and regulatory network
   motifs, respectively.

Results

moETM model overview

   As an overview, moETM integrates multi-omics data across different
   experiments or studies with interpretable latent embeddings
   ([60]Figure 1). It is built upon the widely used VAE[61]^13 to model
   multi-modal data ([62]Figure 1A). However, to tailor the VAE framework
   for the single-cell multi-omics data, we made two main contributions on
   both the encoder and the decoder of the VAE.

   The encoder in moETM is a two-layer fully connected neural network,
   which infers topic proportion from multi-omics normalized count vectors
   for a cell. We assume the latent representation of each omics follows a
   K-dimensional independent logistic normal distribution. Our goal is to
   effectively combine these distributions into a joint distribution of
   the multi-omics data. To this end, we take the product of the
   K-dimensional Gaussians, i.e., product-of-Gaussians (PoG). Because the
   PoG is also a Gaussian density function, we can represent the joint
   latent distribution in closed form. In principle, this results in a
   tighter evidence lower bound (ELBO) and therefore more efficient
   variational inference compared with the MoE approaches[63]^14 as
   adopted in MultiVI/TotalVI[64]^6^,[65]^7 and scMM.[66]^9 In particular,
   these MoE approaches sample K-dimensional Gaussian variables for each
   omics and then take their average. In contrast, our PoG formalism
   requires sampling only once from the joint Gaussian. Therefore, moETM
   may confer more robust estimates thanks to the reduced sampling noise
   from the Monte Carlo approximation procedure. We perform a softmax
   transformation on the joint Gaussian density. The resulting logistic
   normal distribution can be considered as a topic mixture membership for
   the cell. These topics can be directly mapped to known cell types based
   on their top gene signatures detected from our linear decoder as the
   topic distribution must sum to 1 over the K topics; the inferred topic
   mixture membership of a cell expresses statistical uncertainty in the
   cell embedding.

   On the decoder side, inherited from our earlier work,[67]^15 moETM
   employs a linear matrix factorization to reconstruct the normalized
   count vectors from the cell embedding. Our working hypothesis is that
   the encoder creates a linearly separable space for the decoder to
   achieve a good reconstruction when the two networks are trained
   end-to-end. Specifically, the decoder factorizes the cell-by-feature
   matrices into a shared cell-by-topic matrix
   [MATH: <mrow><mi mathvariant="bold">Θ</mi></mrow> :MATH]
   , a shared topic-embedding matrix
   [MATH: <mrow><mi mathvariant="bold">α</mi></mrow> :MATH]
   , and M separate feature-embedding matrices
   [MATH: <mrow><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   , where
   [MATH: <mrow><mi>m</mi><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><mrow><mo>{</mo><mrow><mn>1</mn><mo>,</mo>
   <mo>…</mo><mo>,</mo><mi>M</mi></mrow><mo>}</mo></mrow></mrow> :MATH]
   indexes the omics. Since different omics share the same cells-by-topics
   matrix but have their own feature-embedding matrices, we can explore
   the relations among cells, topics, and features in a highly
   interpretable way. This departs from the existing VAE models such as
   scMM,[68]^9 BABEL,[69]^16 and Multigrate[70]^10 that used a neural
   network as the decoder. Another main challenge in single-cell data
   analysis is the batch effects, which are sources of technical
   variation. To account for those, we introduced the omics-specific batch
   removal factors
   [MATH: <mrow><msup><mi
   mathvariant="bold">λ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo linebreak="goodbreak" linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><msup><mi>V</mi><mrow><mo>(</mo
   ><mi>m</mi><mo>)</mo></mrow></msup><mo>×</mo><mi>S</mi></mrow></msup></
   mrow> :MATH]
   for each omics m ([71]Figure 1A), which act as linear-additive
   batch-specific biases in reconstructing each modality. By regressing
   out the batch effects via
   [MATH: <mrow><msup><mi mathvariant="bold">λ</mi><mrow><mo>(</mo><mo
   linebreak="badbreak">·</mo><mo>)</mo></mrow></msup></mrow> :MATH]
   , moETM can learn biologically meaningful representations in terms of
   the cell topic mixture and the topic/feature embedding. As detailed in
   [72]STAR Methods, all the parameters in moETM are learned end-to-end by
   maximizing a common objective function defined as the ELBO of the
   marginal data likelihood under the framework of amortized variational
   inference.

Multi-omics integration

   We performed quantitative evaluations of moETM on the integrated
   low-dimensional representation compared with six state-of-the-art
   multi-omics integration methods (SMILE,[73]^5 scMM,[74]^9 Cobolt,[75]^8
   MultiVI/TotalVI,[76]^6^,[77]^7 MOFA+,[78]^11 and Seurat V4[79]^12) on
   seven published datasets. Four out of the seven datasets are
   single-cell transcriptome and chromatin accessibility (gene + peak)
   datasets and the other three are single-cell transcriptome and surface
   protein expression (gene/transcript + protein) datasets measured by
   CITE-seq.

   The performance of the multi-omics integrative task was based on both
   biological conservation metrics and batch removal metrics (STAR
   Methods). For the biological conservation score, we adopted the common
   metrics including Adjusted Rand Index (ARI)[80]^17 and Normalized
   Mutual Information (NMI[81]^18). For evaluating batch effect removal,
   we used k-nearest neighbor batch effect test (kBET)[82]^19 with graph
   connectivity (GC).

   To make a comprehensive comparison, we used three experimental
   settings: (1) 60/40 random split for training and testing with 500
   repeats ([83]Tables S1 and [84]S2), (2) training and testing both on
   the whole dataset ([85]Table S3), and (3) training and testing across
   different batches ([86]Tables S4 and [87]S5). The number of topics was
   set to 100 during the training based on the robust performance
   ([88]Figure S1). Overall, we obtained consistent results across all
   three settings and therefore chose to focus on describing the results
   based on the first setting.

   We observed that moETM achieved the best overall performance when
   averaging over all or gene + protein datasets’ performance scores among
   three out of four evaluation metrics ([89]Figure 2). It conferred the
   second-highest averaged kBET, which is only marginally behind
   multiVI/totalVI. The latter two methods might have over-corrected the
   batch effect at the expense of biological conservation. When averaging
   over gene + peak datasets, moETM can still achieve the best performance
   among two out of four evaluation metrics. Specifically, moETM ranked
   the second highest on ARI and NMI and slightly behind Seurat V4, which
   has a larger standard deviation compared with moETM.

Figure 2.

   [90]Figure 2
   [91]Open in a new tab

   Methods comparison based on cell clustering

   The left column illustrates the individual performance of each method
   on each dataset. The seven datasets are indicated on the x axis with
   gene + peak datasets colored in blue and gene + protein datasets
   colored in black. Within each dataset, the highest value was labeled on
   the top of the corresponding bar. The middle column is the comparison
   of averaging values across datasets for each method. The right column
   is the comparison between moETM and its three ablated versions. Each
   row represents an evaluation metric.

   (A) Adjusted Rand Index (ARI).

   (B) Normalized Mutual Information (NMI).

   (C) k-Nearest neighbor batch effect test (kBET).

   (D) Graph connectivity (GC).

   For individual datasets, moETM is either the best or the second best
   method on six out of seven datasets (except MBC) for different
   experimental settings in terms of the ARI ([92]Figure 2A; [93]Tables S1
   and [94]S2). One possible reason could be that the sample size of MBC
   (3,293 cells) from which moETM learns high-dimensional peak embeddings
   is small compared with the other 6 datasets. To assess the benefits of
   the added features in moETM, we compared moETM with its ablated
   versions (moETM_rna, moETM_atac, and moETM_protein), where moETM was
   trained on a single omic. As expected, the performance of moETM on
   single modality decreased, indicating that moETM could improve its
   performance by leveraging multiple modalities ([95]Figure 2, right
   panel).

   Similar quantitative conclusions can be drawn based on NMI
   ([96]Figure 2B; [97]Tables S1 and [98]S2). For kBET ([99]Figure 2C),
   moETM is the best for the BMMC1 dataset and the second best on the
   other datasets—slightly behind MultiVI/TotalVI. Therefore, while moETM
   conferred higher biological conservation scores in terms of ARI and
   NMI, it still maintains comparable kBET scores on all four datasets
   compared with MultiVI/TotalVI. Indeed, we observed an excellent balance
   between the biological conservation and batch effect removal because
   moETM achieved notably higher GC compared with all other methods
   ([100]Figure 2D). This is because GC is the only metric that is based
   on both cell types and batch labels by measuring the similarity among
   cells of the same type from different batches based on the embedding
   learned by each method.[101]^20

   We postulated that the main reason for moETM’s superior integration
   performance is its PoG formulation. To that end, we constructed
   moETM_avg, which replaced PoG with averaging of sampled variables from
   individual Gaussian distributions similar to the existing VAE models
   such as scMM.[102]^9 As expected, the performance of moETM_avg was
   worse than moETM in all datasets in terms of all evaluation metrics
   ([103]Figure 2, right panel). Furthermore, since scMM also adopted the
   average of the Gaussian samples from the encoder, the fact that
   moETM_avg outperformed scMM indicates the benefits of using the linear
   decoder, which further improves the multi-omics integration while
   correcting batch effects across all cells.

   We further verified the clustering performance by visualizing the cell
   topic mixture embeddings using Uniform Manifold Approximation and
   Projection (UMAP)[104]^21 ([105]Figure 3). Indeed, not only did moETM
   remove batch effects but also revealed a better representation of cell
   type clusters. For example, “Plasmablast IGKC–” cells were grouped
   closely by moETM but were clustered into multiple small parts by SMILE
   ([106]Figure 3A). Moreover, plasmablast cells from different batches
   were also mixed better by moETM compared with SMILE, which indicated a
   better batch effects correction. “CD4+ T activated” and “CD4+ T naïve”
   cells were closer within the same cluster but clearly distinguishable
   between themselves. In contrast, these two cell types were mixed
   together by SMILE and scMM. In modeling the BMMC1 dataset (gene +
   peak), “B1 B” cells and “naive CD20^+ B” cells ([107]Figure 3B) were
   mixed by other methods while better separated by moETM.

Figure 3.

   [108]Figure 3
   [109]Open in a new tab

   UMAP visualization of cell clustering

   (A) UMAP visualization of moETM, SMILE, and scMM on single-cell
   CITE-seq from BMMC2 dataset colored by batches (upper panel) and cell
   types (lower panel).

   (B) UMAP visualization of moETM, SMILE, and scMM on the gene + peak
   multi-omics data from the BMMC1 dataset.

   In addition, we visually compared the cell clustering by the individual
   modalities with that by the integrated modalities via UMAP
   ([110]Figure S2). Similarly, we also compared the cell clustering based
   on the individual omics of single-cell transcriptome and surface
   proteins with the clustering based on the integrated RNA + protein
   topic mixture using the BMMC2 dataset. We observed that the integrated
   topic representation led to a more coherent cell clustering compared
   with the topic mixture of individual omics ([111]Figure S2). For
   example, the “CD14^+ Mono” cells were grouped more closely by the
   integrated topic mixture compared with their unimodal counterparts
   ([112]Figure S2A; ARI: 0.735 by RNA + peak in contrast to 0.690 by RNA
   only and 0.648 by peak only). Similarly, the plasmablast IGKC– cells
   also formed tighter clusters in the integrated RNA + protein embedding
   space ([113]Figure S2B; ARI: 0.734 by RNA + protein in contrast to
   0.688 by transcriptome only and 0.590 by surface proteins only).

   Therefore, moETM was able to improve cell clustering by integrating
   multiple modalities. Taken together, these results show that moETM is
   able to distinguish similar cell types by capturing biological
   information in its encoding space while removing batch effects.

Cross-omics imputation

   In the case of gene + protein, moETM accurately imputes surface protein
   expression from gene expression for the BMMC2 dataset, achieving
   average Pearson (Spearman) correlation of 0.95, 0.92, and 0.88 (0.94,
   0.90, and 0.85) on random split, leave-one-batch, and
   leave-one-cell-type imputation experiments, respectively
   ([114]Table S6). We visualized the reconstructed protein expression
   against the observed values using the BMMC2 (gene + protein) dataset
   ([115]Figure 4A). The imputed protein expression is highly linearly
   correlated with the observed one ([116]Figure 4B), which is what we
   expected given the high Pearson correlation of 0.95. The runner up
   methods—namely, scMM and BABEL—also performed well on this task, both
   achieving a correlation score of 0.94.

Figure 4.

   [117]Figure 4
   [118]Open in a new tab

   Cross-omics imputation

   (A) Heatmap of original protein and imputed protein values from gene
   expression using the BMMC2 CITE-seq dataset. The column and row orders
   are the same for the two heatmaps.

   (B) Scatterplot of original and imputed surface protein expression. The
   diagonal line is in blue color.

   (C and D) Heatmap and scatterplot of the original and imputed gene
   expression from chromatin accessibility on the BMMC1 dataset. For more
   results, see [119]Figure S3.

   Compared with the surface protein imputation task, imputing gene
   expression from the open chromatin regions is a more challenging task
   because of the sparser input scATAC-seq signals and the dynamic and
   often asynchronous interplay between the chromatin states and the
   transcriptome.[120]^22^,[121]^23^,[122]^24 Nonetheless, moETM achieved
   relatively high Pearson (and Spearman) correlation scores of 0.69,
   0.65, and 0.58 (and 0.37, 0.35, and 0.32) on random split,
   leave-one-batch, and leave-one-cell-type experiments. These are notably
   higher than the corresponding correlation obtained by BABEL (Pearson:
   0.65, 0.60, 0.55; Spearman: 0.34, 0.33, 0.30) and scMM (Pearson: 0.63,
   0.61, 0.54; Spearman: 0.33, 0.33, 0.28) ([123]Table S6). Qualitatively,
   the imputed and the observed gene expression profiles also exhibit
   similar pattern and linear relationship ([124]Figures 4C and 4D).

   In the previous two imputation applications, low-dimensional modalities
   were generated from high-dimensional modalities. The imputation from
   the low dimension to the high dimension is more difficult but
   nonetheless feasible. Specifically, on the three same experimental
   designs, the Pearson (and Spearman) correlations between the observed
   and the imputed open chromatin regions from gene expression are 0.58,
   0.55, and 0.51 (and 0.33, 0.30, and 0.28); the Pearson (and Spearman)
   correlation between the observed and imputed gene expression from
   protein expression are 0.65, 0.63, and 0.60 (and 0.41, 0.39, and 0.37)
   ([125]Table S6). In contrast, the runner-up method scMM achieved
   Pearson (and Spearman) correlations of 0.40, 0.29, and 0.37 (and 0.29,
   0.25, and 0.21) for imputing chromatin accessibility from gene
   expression. For imputing gene expression from surface protein, scMM and
   BABEL also fell behind moETM in terms of both Pearson and Spearman
   correlations ([126]Table S6). Qualitatively, the imputed and the
   observed peaks and gene expression exhibit consistent patterns
   ([127]Figures S3A and S3C) and strong linear trends ([128]Figure S3B
   and S3D).

Correlating RNA transcripts with surface proteins and in cis chromatin
accessibility regions

   As a proof-of-concept, we sought to assess whether the top surface
   proteins can be mapped to the top genes under the same topic (i.e.,
   following the central dogma). To this end, we trained a 100-topic moETM
   on the BMMC2 (gene + protein) dataset generated by CITE-seq over 90,000
   cells. For each topic, we calculated the Spearman correlation of topic
   scores between the 134 pairs of the gene transcripts and the
   corresponding translated surface proteins ([129]Figure 5A). The
   correlations ranged from −0.096 to 0.751 with an average of 0.29. In
   particular, 96 of the 100 topics have positive correlations. Among
   them, 13 topics have correlations larger than 0.5.

Figure 5.

   [130]Figure 5
   [131]Open in a new tab

   Topic analysis of gene + protein CITE-seq data

   (A) Protein-RNA correlations and pathway enrichments for the 100 topics
   learned from the CITE-seq BMMC2 data. In the middle and the bottom
   panels, dots correspond to the tested immunologic signature gene sets
   from MSigDB. Different colors represent different gene sets.

   (B) Topics embedding of 10,000 sub-sampled cells from the BMMC2
   dataset. Only the topics with the sum of absolute values greater than
   the third quartile across all sampled cells were shown. The two color
   bars display two tiers of annotations for the 9 broad cell types
   (cellType1) and 45 fine-grained cell types (cellType2).

   (C) GSEA leading-edge analysis of topic 40. The left and right panel
   represent significantly enriched gene sets (q value
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after"><</mo></mrow> :MATH]
   0.001) based on gene topic scores and protein topic scores,
   respectively.

   (D) Genes and proteins signatures of the select topics. The upper and
   lower panels display the topics-by-genes and topics-by-proteins
   heatmap. The top genes and proteins that are known cell-type markers
   based on CellMarker or literature search are highlighted in blue.

   (E) UMAP visualization of the genes, proteins, and topics via their
   shared embedding space. The corresponding topic indices and
   gene/protein symbols were highlighted by corresponding colors.

   To further quantify the known transcript-protein as well as gene-peak
   regulatory relations, we computed their Spearman correlations across
   topics inferred from the BMMC datasets. We paired a peak with a gene if
   it was within 150k bp distance from the transcription start site of the
   gene. The overall distribution of the correlations for
   transcript-protein pairs and gene-peak pairs were both significantly
   higher than 0 (p
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after"><</mo></mrow> :MATH]
   2.2e−16; one-sample t test) and comparable with the correlations
   calculated directly from the observed data ([132]Figures S4A and S4C).

   Notably, 90% of transcript-protein pairs exhibited positive
   correlations and nine pairs displayed correlations exceeding 0.5
   ([133]Figure S4E). Nonetheless, several transcript-protein pairs
   exhibited low or negative correlations. Several factors could
   contribute to these low correlations. Firstly, random noise may hinder
   correlations between genes and proteins. Secondly, dynamic cellular
   processes at the single-cell level can give rise to variations between
   cells, leading to a decrease in correlations.[134]^25 For example,
   transcriptional bursting or delays between transcription and
   translation will lead to asynchronous behavior of gene and protein
   during the cell cycle, thereby reducing the correlations between gene
   and protein expression levels.[135]^25 Particularly, a number of
   transcript-protein pairs displayed negative correlations
   ([136]Figure S4E). This phenomenon has also been observed in previous
   studies.[137]^26^,[138]^27^,[139]^28 For instance, Li et al. reported a
   mismatch between mRNA and protein expression levels, including the
   CD69-CD69 pair.[140]^27 One possible cause might be due to the impact
   of other biological processes overriding the effects of
   transcription.[141]^28 Taking CD69-CD69 as an example, the CD69 gene
   may undergo post-translational modifications such as differential
   glycosylation.[142]^29 The transcribed CD69 mRNA molecules can be
   translated to a 22.5 kDa polypeptide, which can further be
   differentially glycosylated to two different subunits. These subunits
   can be randomly combined to form different receptors, leading to a
   reduction in the abundance of the CD69 protein.[143]^29 If the
   influence of post-translational modifications surpasses the impact of
   protein synthesis, it can give rise to a negative correlation.
   Furthermore, the CD69 mRNA transcripts are unstable. The level of CD69
   mRNA could decline rapidly with a half-life of less than
   60 min.[144]^29 While mRNA molecules degrade over time, protein levels
   may maintain relatively stable or continue to accumulate. If the rate
   of mRNA degradation surpasses that of protein synthesis, a negative
   correlation could emerge.

   In addition to investigating correlations across topics, we calculated
   correlations across cells by computing Spearman correlations in terms
   of observed values and reconstructed values based on the BMMC datasets
   ([145]Figures S4A–S4D). The correlations based on both the observed and
   reconstructed data across cells were significantly greater than 0,
   indicating consistent relations among transcript-protein and gene-peak
   pairs captured at the cell level. However, the correlations from the
   reconstructed values are higher than those from the observed values
   ([146]Figures S4B and S4D). This is because the observed values may
   contain random noise or batch effects compared with reconstructed
   values by moETM, which can be considered as denoised and
   confounder-corrected values of the gene/protein/peak signals.

Immune cell-type signatures revealed by multi-omics topics learned from
CITE-seq data

   To identify cell-type signatures, we associated each topic with the
   specific cell type that exhibit the highest average topic score across
   cells. Notably, not all topics were uniquely associated with one single
   cell type and some topics might be enriched for a combination of
   multiple cell types. Therefore, we chose to describe a selected subset
   of the topics based on their distinctly enriched cell types and heatmap
   visualization ([147]Figure 5B).

   For instance, topic 44 was associated with monocytes and consists of
   CD14^+ and CD16^+ Mono; topic 40 was associated with B cells and
   consists of primarily naive CD20^+ B IGKC^+ and naive CD20^+ B IGKC^–
   cells; topic 83 was associated with natural killer cells. These are
   visually easy to detect from the topic mixture probabilities among the
   individual cells ([148]Figure 5B).

   Under each cell-type-enriched topic, many top genes and top proteins
   are the known cell-type markers ([149]Figure 5D). For example, under
   topic 40 (i.e., a B cell topic), the top genes CR2, SSPN, and ADAM28
   are known marker genes for B cells; the top proteins CD21, CD20, and
   CD40 are also marker proteins for B cells according to the CellMarker
   database.[150]^30 For topic 7, one of top proteins CD11c is a marker
   protein for dendritic cells (DCs).[151]^30 For topic 83, protein CD16,
   marker for natural killer cells, is among its top proteins.[152]^30 For
   topic 44, the top gene S100A9’s coding protein is a chemotactic factor
   for monocytes[153]^31 and is highly expressed during inflammatory
   processes[154]^32; among the top proteins for topic 44, CD36,[155]^33
   CD33, and CD11c[156]^34 are also markers for monocyte sub-cell types.
   Similarly, the monocyte is also enriched in topic 23, which shares the
   top marker protein CD16 with topic 44 but also contains unique top
   genes such as CDKN1C and FCGR3A. While CDKN1C is a known marker gene
   for monocyte,[157]^35 FCGR3A is upregulated in CD16^+ monocytes as
   supported by the existing literature.[158]^36

   Moreover, we performed gene set enrichment analysis
   (GSEA)[159]^37^,[160]^38 using the topic scores for all of the genes
   and proteins. Because BMMCs are immune cells, we queried the
   C7 ImmuneSigDB from MSigDb, which is a collection of 5,219 gene sets
   related to immune pathways.[161]^39^,[162]^40^,[163]^41 Across all
   100 topics, we identified 2,569 enriched gene sets with q values
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after"><</mo></mrow> :MATH]
   0.05 using gene topic scores and 22 enriched gene sets using protein
   topic scores ([164]Figure 5A). For example, in topic 40, using the gene
   topic scores, we found a gene set that consists of upregulated genes in
   B cells with respect to monocytes[165]^42 ([166]Figure 5C, left panel);
   using the protein topic scores, we found a gene set that consists of
   upregulated genes in B cells compared with plasmacytoid DCs
   ([167]Figure 5C, right panel).[168]^42

   Furthermore, we projected the topic embeddings and feature embeddings
   onto a common 2D space using UMAP ([169]Figure 5E). We observed that
   the top marker genes and the top marker proteins for the cell type
   clustered together around the corresponding topics, implying a
   well-aligned embedding space within and across these modalities.
   Together, the results suggested that the cell-type-enriched topics
   inferred by moETM from the CITE-seq data reveal meaningful biological
   relations between genes and proteins.

Joint multi-omics topic analysis identified cell-type-specific pathways and
regulatory motifs

   The topic embedding learned from the scRNA + scATAC data enables us to
   investigate the relationship between top genes and top peaks in the
   cell-type-specific topics. Given that many top genes are known
   cell-type markers ([170]Figure 6A), we postulated that the top peaks
   could be associated with the top genes via in cis or in trans
   regulatory elements. One challenge in interpreting the gene + peak
   multi-omics topics is that peaks cannot be matched directly with genes.
   We proposed two approaches to solve this issue. One is to link peaks to
   their nearby genes to obtain the peak-neighboring genes (STAR Methods).
   The other approach is to identify enriched motifs among the top peaks
   and explore the relationship between genes and motifs via the
   corresponding transcription factors (TFs) and their target genes.

Figure 6.

   [171]Figure 6
   [172]Open in a new tab

   Topic analysis of single-cell gene + peak data from the BMMC1 dataset

   (A) Top genes and top peak-neighbor genes of the select topics.

   (B) Topic embedding of cells from the BMMC1 dataset.

   (C) GSEA leading edge analysis of topic 3. The left panel is the GSEA
   result using gene topic scores and the right panel is the GSEA result
   using peak-neighboring gene topic scores.

   (D) Pathway enrichment and motif enrichment for the 100 topics.

   (E) Topic-directed regulatory networks based on motif enrichment
   analysis. The yellow edges indicate known TF-target associations based
   on ENCODE TF Targets dataset.

   (F) Topic-directed regulatory networks incorporating enriched pathways,
   motifs, and top genes. For more results, see [173]Figure S5.

   For the first approach, the top genes and top peak-neighboring genes in
   the select topics served as markers for the cell-type-specific gene
   regulatory programs ([174]Figure 6A). For example, topic 32 is
   associated with CD8^+ T cells ([175]Figures 6A and 6B). We zoomed-in
   the topic by examining its top genes and top peaks. Three of the top 5
   genes (TNFRSF9, ASTL, GZMK, DUSP2, and DGKH) were related to T cells.
   In particular, GZMK is a marker gene for T cells based on
   CellMarker[176]^30; TNFRSF9 codes for a signaling protein that promotes
   expression of cytokines in CD8^+ T cells[177]^43; DUSP2 encodes an
   inducible nuclear protein and is highly expressed in T cells.[178]^44
   Among the top 5 peak-neighboring genes (APBA2, PRDX2, KLRC4, OBSCN, and
   XCL2), APBA2 is a marker gene for cytotoxic CD8^+ T
   lymphocytes[179]^45; XCL2 expression levels substantially increased in
   CD8^+ T cells during T cell activation.[180]^46 As another example,
   topic 3 is associated with CD4^+ T naive cells. Three out of the top 5
   genes (CCR4, ADAM12, PTPN13, MB21D2, and IL4I1) and two out of the top
   5 peak-neighboring genes (INPP4B, CCR4, PRDX2, RORA, and HIST1H2BD) are
   related to T cells. Indeed, CCR4 is shown to be specifically expressed
   among naive CD4^+ T cells[181]^47; ADAM12 is expressed in T cells in
   the inflamed brain and is a potential target for the treatment of
   Th1-mediated diseases[182]^48; IL4I1 increases the threshold of T cell
   activation and partially modulates CD4 T cell differentiation.[183]^49
   For top peak-neighboring genes, RORA is upregulated among the activated
   CD4^+ T cells.[184]^50

   To gain further mechanistic understanding of the inferred topics, we
   performed GSEA on the topic scores for the genes from the transcriptome
   modality and the topic scores for the peak-neighboring genes from the
   chromatin accessibility modality ([185]Figure 6D). Many enriched gene
   sets are related to the topic-associated cell types. For topic 3, for
   instance, one of the enriched gene sets based on the gene topic scores,
   is upregulated in healthy CD4 T cells compared with healthy myeloid
   cells[186]^51 ([187]Figure 6C). This is consistent with an enriched
   gene set from the peak-neighboring gene analysis of topic 3, where the
   gene set consists of a set of genes that were upregulated in naive CD4
   T cells relative to the DC.[188]^52 Therefore, GSEA further confirmed
   the cell-type-specific functions of the top genes and peak-neighboring
   genes identified via moETM’s topics. Interestingly, the top transcripts
   and the top peak-neighboring genes do not often correspond to the same
   genes. This implies that the peaks and genes provide complementary
   information to (sometimes the same) cell-type-specific regulatory
   programs. Therefore, by effectively integrating the scRNA-seq and
   scATAC-seq data, the inferred multi-omics topics can reveal functional
   convergence at the pathway level.

   Besides using peak-neighboring genes, as the second approach we also
   performed motif enrichment analysis on the top 100 peaks per topic
   (STAR Methods; [189]Figure 6D). We then constructed a putative
   regulatory network by linking the top genes and the enriched motifs via
   their associated topics ([190]Figures 6E and 6F). Interestingly, some
   of the top genes harbor those enriched motifs, implying that these
   genes are the putative target genes of the cognate TF. In topic 3, for
   example, one of the enriched motifs corresponds to a TF named FLI1 (p =
   0.00117), and the top genes IL4l1 and PTPN13 are target genes of FLI1
   based on the ENCODE Transcription Factor Targets.[191]^53^,[192]^54 As
   another example, one of the enriched motifs for topic 32 corresponds to
   TF MEF2A (p = 5.21e-5), whose target genes include the top genes RGS1,
   EGR1, GZMK, ASTL, and DUSP2.[193]^53^,[194]^54

   We further expanded our topic-network analysis by including enriched
   pathways and cell type information ([195]Figures 6E and [196]S5). We
   defined the intra-connections within the same topic as edges between
   the topic nodes and cell type nodes. We also established
   inter-connections between genes and external nodes including motifs and
   pathways. Specifically, the top genes under each topic could serve as
   members of enriched pathways or target genes for enriched motifs. For
   instance, in topic 32, gene DUSP2 is a target gene for enriched motifs
   MEF2A (p = 5.21e-5, permutation test) and PAX5 (p = 1.53e-5,
   permutation test), while also being a member gene in four enriched
   pathways and three of them are upregulated gene sets in the enriched
   cell type of T cells (UNSTIM_VS._ACD3_ACD28_STIM_WT_CD4_TCELL_DN,
   NKT_CELL_VS._ALPHAALPHA_CD8_TCELL_DN,
   UNSTIM_VS._ACD3_ACD28_STIM_WT_CD4_TCELL_DN). Similarly, in topic 30,
   gene FCAR is a target gene for one enriched motif CEBPB (p = 1.27e−5,
   permutation test) and a member of three enriched pathways, two of which
   are related to gene sets upregulated in the enriched cell type of
   monocyte (MONOCYTE_VS._MDC_UP,
   PBMC_MRKAD5_HIV_1_GAG_POL_NEF_AGE_20_50YO_1DY_UP,
   MONOCYTE_VS._MDC_DAY7_FLU_VACCINE_UP). Likewise, in topic 3, gene DPP4
   is a target gene for two enriched motifs PAX5 (p = 5.39e-4, permutation
   test) and SP1 (p = 1.13e-4, permutation test) and a member of 11
   enriched pathways where four of them (NAIVE_TCELL_VS._NKCELL_UP,
   NAIVE_TCELL_VS._MONOCYTE_UP, NAIVE_CD4_TCELL_VS._MONOCYTE_UP,
   NAIVE_CD4_TCELL_VS._DC_UP) are upregulated gene sets in CD4^+ T naive
   cells. Those connections highlighted a consistent regulatory
   relationship across motifs and pathways under inferred topics.

   Therefore, our multi-omics topic analysis suggests that some of the
   cell-type-specific regulatory programs are implicated with the sequence
   motifs and pathways. Further investigation is needed to establish the
   hierarchical relation between the TF and the cell lineage.

Prior pathway-informed enrichment

   The single-cell multi-omics data are high-dimensional, sparse, and
   noisy. This is especially the case for the scRNA + scATAC-seq data
   because of the large number of genes and open chromatin regions. One
   way to further improve the interpretability of the topics derived from
   these data is by incorporating prior knowledge such as gene sets or
   pathway information. In the context of our moETM, this was done by
   fixing the embeddings-by-genes parameters to the observed
   pathways-by-genes matrix (STAR Methods). Using the
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">∼</mo></mrow> :MATH]
   7,000 gene ontology biological process terms as the pathways-by-genes
   matrix, we trained the pathway-informed moETM (p-moETM) on the BMMC1
   gene + peak dataset.

   Quantitatively, p-moETM can achieve comparable cell-clustering
   performance with an ARI of 0.72, which is only slightly lower than the
   default moETM that learned the gene embedding directly from the data
   ([197]Table S1). We also identified several cell-type-specific topics
   along with their top genes and peaks ([198]Figure S6C–S6H). Notably,
   the learned topics-by-embeddings matrix
   [MATH: <mrow><mi mathvariant="bold">α</mi></mrow> :MATH]
   from p-moETM is essentially the topics-by-pathways matrix. This allows
   us to directly identify the top pathways for each topic without
   performing post-hoc GSEA. For instance, topic 25 is associated with B1
   B cells ([199]Figure S6C). One of its top pathways is related to B cell
   activation ([200]Figure S6A). As another example, topic 8 was enriched
   for the CD4^+ T activated cells, and one of its top pathways was
   connected to the T cell apoptotic process.

   For some topics, their top genes are both the members of the pathway
   and the cell-type biomarkers ([201]Figure S6, left panel). For
   instance, topic 27 is enriched in the CD4^+ T naive cells. One of its
   top genes CCR7 is involved in the elimination process of immature
   T cells. In addition, topic 41 is enriched for the transitional B cell.
   Its top pathways include B cell activation and adaptive immune process.
   Among its top genes, TNFAIP3 is in the B cell activation-related
   pathway. One of its top peaks in chr14: 100207793–100208735 is upstream
   of the promoter of YY1 (chr14: 100238298–100282788), which is a gene
   member in the B cell activation-related pathway.[202]^55

   Furthermore, we experimented using a more specific gene set namely the
   immune signature gene set collections from MSigDB to investigate
   immune-related pathways implicated in the BMMC1 dataset
   ([203]Figure S6, right panel). We identified several cell-type-specific
   topics that exhibit high scores for meaningful immune pathways. For
   instance, topic 23 is enriched in naive CD20^+ B cells. Two of its top
   10 pathways are associated with naive B cells. One of its top genes
   namely HLA-DPB1 is upregulated in naive B cells relative to the plasma
   cells.[204]^56 One of the top peaks (chr12: 8886393–8887019) is
   upstream of PHC1 (chr12:8913896–8941467), which is also involved in the
   pathway where genes are upregulated in naive B cells relative to the
   plasma cells.[205]^56

Multi-omics topics reveal the molecular basis of COVID-19 severity

   As the CITE-seq technology interrogates the expression of surface
   proteins along with the full transcriptome, it is a promising platform
   to investigate the immune responses among patients infected by the
   SARS-CoV-2 virus (COVID-19). Using moETM, we sought to identify
   clinically relevant molecular signatures from a COVID-19 CITE-seq
   dataset (HBIC).[206]^57 The data consist of 781,123 cells from 130
   COVID-19 patients with varying degrees of severity due to the viral
   infection. To establish model confidence, we first performed a
   quantitative analysis as above. The results showed that moETM could
   achieve either the highest or the second-highest evaluation metrics
   both in bio-conservation and batch removal cases ([207]Table S2). In
   particular, moETM ranked first with an ARI value of 0.752. Similarly,
   moETM and TotalVI attained the highest NMI scores of 0.779 and 0.762,
   respectively. Both methods also maintained their top performance in
   terms of batch correction with TotalVI achieving the highest kBET of
   0.197, while moETM came in second with 0.153. Consistent to the above
   evaluation ([208]Table S2), moETM obtained the best GC score of 0.950,
   whereas TotalVI achieved the second best of 0.934. Therefore, these
   quantitative results on the COVID-19 data further suggest that moETM
   strikes a good balance between biological conservation and batch effect
   correction.

   Qualitatively, we investigated the top features and identified enriched
   cell types under each topic ([209]Figures S7A and S7B). In particular,
   topic 42 is enriched for B cells. Among its top 5 genes (SLC38A11,
   TCL1B, IL6, TCL1A, SYN3), IL6 and TCL1A are the known marker genes.
   Also, three out of its top 5 proteins (CD19, CR1, CD22, FCGR2A, and
   BAFFR) are marker proteins for B cells. Topic 31 is associated with
   platelets. Two out of its top 5 genes (LYVE1, RADIL, VWF,[210]^58
   TRHDE, and PPBP) are marker genes, namely VWF and PPBP, and one of its
   top 5 proteins (ITGA2B, KIR3DL1, ITGAX, SELP, and FCGR2A) is a marker
   protein (i.e., ITGA2B) for platelets. In addition, a previous study has
   suggested that SELP redistributes to the plasma membrane during
   platelet activation.[211]^59 The enriched pathways based on GSEA are
   consistent to the cell-type specificity of those topics
   ([212]Figure S7C). Taking topic 42 as an example, the enriched pathway
   is the gene set that is downregulated in CD4 T cells compared with B
   cells.[213]^51 Because of the shared embedding space, we also observed
   localization of the top genes and the top proteins for the selected
   topics via UMAP ([214]Figure S7D).

   We then leveraged the phenotype severity information among the patients
   to explore gene and protein signatures related to the COVID-19
   phenotypes. Specifically, we utilized COVID metadata information to
   test whether a topic is significantly over-represented for the severity
   conditions. Here, we considered each topic as a “meta-gene” and
   associated their upregulation or downregulation with the disease
   phenotypes ([215]Figures 7A and 7B). We delved into topics based on
   their distinctly enriched cell types, heatmap visualization
   ([216]Figure S7A), and differential analysis by the phenotypes (e.g.,
   COVID severity) (STAR Methods). For example, we observed that topic 42
   is not only enriched for B cell but also upregulated among patients
   with critical COVID status, whereas topic 80 is significantly
   associated with the severe status. Moreover, topic 42 is associated
   with other demographic features such as age and mainly enriched in the
   senior group between 70 and 79 years ([217]Figure 7A).

Figure 7.

   [218]Figure 7
   [219]Open in a new tab

   Topic association with the COVID-19 severity status

   (A) Differential analysis of severity states, sex, smoking history, and
   age. The color intensity values correspond to the differences of
   average topic scores between the positive cells and negative cells for
   each attribute and each topic. Asterisks indicate Bonferroni-adjusted p
   value
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after"><</mo></mrow> :MATH]
   0.001 based on one-sided t test of upregulated topics for each label.

   (B) Differential analysis of topics across cell types. The heatmap on
   the left displays the topic associations with each of the 18 cell
   types, and the one on the right associates the same topics with 6
   fine-grained B cell subtypes.

   (C) UMAP visualization of cell clustering. The right panel shows a
   zoom-in version of the B cell clustering with color indicating the six
   B cell subtypes.

   (D) UMAP visualization with cells colored by source subjects’ severity
   states due to COVID-19 infection.

   (E) Normalized gene expression of IL6 among the cells on the same UMAP.

   Given its disease relevance, we further investigated topic 42 to see
   whether it elucidates more granular cell types and to some extent
   whether their top gene/protein signatures can serve as putative
   biomarkers for COVID critical conditions. First of all, the
   moETM-inferred cell topic embeddings did not only cluster cells into
   their primary cell types but also sub-divided B cells into six
   sub-clusters of known sub-cell types ([220]Figure 7C and zoom-in view).
   Intriguingly, aligning the COVID phenotypes with the B cell sub-types
   revealed that the critical COVID condition corresponded to B malignant
   cells ([221]Figures 7C and 7D). B cell lymphomas start to develop when
   B lymphocytes, which are in charge of humoral immunity, start to
   proliferate beyond control. This proliferation turns B cells into
   malignant cells.[222]^60 The previous study[223]^61 suggested that
   individuals with certain cancers, such as lymphoma, may be more
   susceptible to getting severe illness from COVID-19. The top gene IL6
   in topic 42 was consistently expressed at a high level among B cells,
   including although not specifically in B malignant cells
   ([224]Figure 7E). IL6 levels were commonly reported in severely ill
   patients due to COVID-19.[225]^62^,[226]^63 As another example, topic
   21 is also enriched in B malignant cells ([227]Figure 7B). One of its
   top proteins CD5 ([228]Figure S7B) was shown to be highly expressed on
   malignant cells.[229]^64 Moreover, the previous study[230]^65 suggested
   that the proportion of CD5^+ B cells was significantly reduced in COVID
   patients. Taken together, our results suggest that IL6 or CD5 may be a
   potential therapeutic target.

Discussion

   Gene regulatory programs involve multi-faceted regulation and cannot be
   understood via a single-omics approach alone. As these technologies
   continue to evolve, computational methods are needed to account for the
   challenges in modeling the sparse, noisy, and heterogeneous nature of
   data that are being generated at a rapid pace.[231]^3 In this study, we
   developed a unified interpretable deep learning model called moETM to
   integrate single-cell multi-omics data including transcriptome and
   chromatin accessibility or surface proteins, which are the most common
   types of single-cell multi-omics data to date.[232]^4

   Our technical contributions are 3-fold. First, via the
   product-of-experts, moETM effectively integrates multiple omics by
   projecting them onto a common topic mixture representation. Second, the
   linear decoder enables the extraction of multi-omics signatures as the
   top features under each latent topic, which directly reveal marker
   genes and phenotype markers under topics that are aligned with cell
   types or phenotype conditions. Third, by efficiently correcting batch
   effects via a dedicated linear intercept matrix in the decoder, we can
   integrate multi-omics data from multiple studies, subjects, or
   technologies, which allows us to exploit the vast amount of multi-omics
   data to obtain biologically diverse and coherent multi-omics topics.

   To demonstrate the utility of moETM, we benchmarked it with six
   existing state-of-the-art computational methods on seven published
   datasets including four gene + peak datasets and three gene + protein
   datasets ([233]Tables S1 and [234]S2). Across all datasets, moETM
   achieved competitive performance based on four common evaluation
   metrics. We also confirmed the advantage of using multiple modalities
   compared with single modality in terms of cell clustering. Moreover,
   because of its joint modeling capabilities, the trained moETM can
   accomplish this cross-omics imputation task. In both imputation
   directions, moETM achieved a higher correlation than scMM and
   BABEL. Although more challenging, moETM also achieved a reasonable
   performance when imputing high dimensions from low dimensions.

   We also explored the moETM-learned cell-type-specific topics in terms
   of their top omics features and enriched pathways in the light of the
   supporting evidence from the literature. moETM is able to detect immune
   cell-type signatures and identify cell-type-specific pathways and
   regulatory motifs. In a more focused study, we analyzed the COVID-19
   CITE-seq dataset (gene + protein) and linked moETM-learned
   immune-specific topics with patient severity conditions due to the
   infection. Our topic analysis revealed not only immune marker genes but
   also cell types that are associated with COVID phenotype conditions.

Limitations of the study

   There are several challenges that are not addressed in moETM.[235]^4
   For instance, moETM has the capacity to integrate across multiple
   batches and modalities, but it requires the training data to have all
   omics measured in the same cells. A more challenging task is to
   integrate multimodal data without anchored features or cells, which is
   commonly known as the diagonal integration.[236]^4 Some recent
   approaches made use of graph representation learning to integrate
   multi-omics single-cell data at the expense of computational complexity
   and interpretability.[237]^66^,[238]^67^,[239]^68

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   BMMC CITE-seq Luecken et al.[240]^69 [241]GSE194122
   BMMC Multiome data Luecken et al.[242]^69 [243]GSE194122
   Mouse skin cell SHARE-seq Ma et al.[244]^24 [245]GSE140203
   Mouse brain cell SHARE-seq Ma et al.[246]^24 [247]GSE140203
   Mouse kidney cell sci-CAR Cao et al.[248]^70 [249]GSE117089
   PBMC CITE-seq Hao et al.[250]^12 [251]GSE164378
   COVID-19 PBMC CITE-seq Stephenson et al.[252]^57
   [253]https://www.covid19cellatlas.org/
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   moETM This paper
   [254]https://github.com/manqizhou/moETM[255]https://doi.org/10.5281/zen
   odo.8104798
   scanpy Wolf et al.[256]^71 [257]https://github.com/scverse/scanpy
   anndata Virshup et al.[258]^72 [259]https://github.com/scverse/anndata
   biomaRt Durinck et al.[260]^73 [261]https://rdrr.io/bioc/biomaRt/man/
   seurat Hao et al.[262]^12 [263]https://satijalab.org/seurat/
   totalVI Gayoso et al.[264]^6
   [265]https://github.com/YosefLab/scvi-tools
   SMILE Xu et al.[266]^5 [267]https://github.com/rpmccordlab/SMILE
   scMM Minoura et al.[268]^9 [269]https://github.com/kodaim1115/scMM
   Cobolt Gong et al.[270]^8 [271]https://github.com/epurdom/cobolt
   MultiVI Ashuach et al.[272]^7 [273]https://zenodo.org/record/5762077
   MOFA+ Argelaguet et al.[274]^11 [275]https://github.com/bioFAM/MOFA2
   [276]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Yue Li
   ([277]yueli@cs.mcgill.ca).

Materials availability

   This study did not generate new unique reagents.

Method details

moETM data generative process

   The molecular activities in each cell n can be measured with M omics,
   such as gene expression from transcriptome, surface protein expression,
   and the open chromatin regions manifested as peaks. For the ease of the
   following descriptions, we define the entities of genes, proteins and
   peaks as “features”. Profiling those omics in the cell leads to M count
   vectors
   [MATH: <mrow><msubsup><mrow><mo stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   :MATH]
   , each of which has a dimension
   [MATH:
   <mrow><msup><mi>V</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ></mrow> :MATH]
   as the number of unique features in omic m. Adapting the text-mining
   analogy, we consider each cell as a “document” written in M languages
   or modalities (i.e., transcriptome, proteome, chromatin accessibility);
   each feature from the
   [MATH:
   <mrow><msup><mi>m</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   omic is considered as a “word” from the
   [MATH:
   <mrow><msup><mi>m</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   vocabulary; each sequencing read is a “token” in the document; the
   abundance of the reads mapped to the same feature is the “word count”
   in the document.

   The multi-modal document of a cell n can be summarized into a mixture
   of K latent topics
   [MATH: <mrow><msub><mi
   mathvariant="bold">θ</mi><mi>n</mi></msub></mrow> :MATH]
   , which are presumably implicated in each modality ([278]Figure 1A).
   Inference of these topic mixtures for each cell is accomplished by
   modeling the distribution of the multi-omic count data
   [MATH: <mrow><msubsup><mrow><mo stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   :MATH]
   from the topic mixture for the cell and learning the global topic
   embedding over the M modalities. The latter are shared among all cells
   and expressed as M matrices
   [MATH: <mrow><msubsup><mrow><mo stretchy="true">{</mo><mrow><msup><mi
   mathvariant="bold">Φ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo linebreak="badbreak">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><mi>K</mi><mo
   linebreak="badbreak">×</mo><msup><mi>V</mi><mrow><mo>(</mo><mi>m</mi><m
   o>)</mo></mrow></msup></mrow></msup></mrow><mo
   stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   :MATH]
   , where a row vector
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">φ</mi><mi>k</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><msup><mi>V</mi><mrow><mo>(</mo><mi>m
   </mi><mo>)</mo></mrow></msup></msup></mrow> :MATH]
   denotes the k-th topic from the m-th modality.

   To increase information sharing across the omics and the model
   expressiveness, we further decompose each omic-specific topic embedding
   matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold">Φ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   into the topic embedding
   [MATH: <mrow><mi mathvariant="bold">α</mi><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><mi>K</mi><mo>×</mo><mi>L</mi><
   /mrow></msup></mrow> :MATH]
   and feature embedding
   [MATH: <mrow><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo linebreak="goodbreak" linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><mi>L</mi><mo>×</mo><msup><mi>V
   </mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow></msup></
   mrow> :MATH]
   , where L denotes the size of the embedding space. The expected values
   for the count data for each omic is proportional to the dot product of
   the cell embedding, topic embedding matrix, and feature embedding
   matrix:
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo linebreak="goodbreak"
   linebreakstyle="after">∝</mo><msub><mi
   mathvariant="bold">θ</mi><mi>n</mi></msub><mi
   mathvariant="bold">α</mi><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   .

   Formally, we formulate the data generative process as follows. For each
   cell indexed by
   [MATH: <mrow><mi>n</mi><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><mrow><mo>{</mo><mrow><mn>1</mn><mo>,</mo>
   <mo>…</mo><mo>,</mo><mi>N</mi></mrow><mo>}</mo></mrow></mrow> :MATH]
   , draw a
   [MATH: <mrow><mn>1</mn><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><mi>K</mi></mrow> :MATH]
   topic proportion
   [MATH: <mrow><msub><mi
   mathvariant="bold">θ</mi><mi>n</mi></msub></mrow> :MATH]
   from logistic normal distribution
   [MATH: <mrow><msub><mi mathvariant="bold">θ</mi><mi>n</mi></msub><mo
   linebreak="goodbreak" linebreakstyle="after">∼</mo><mi
   mathvariant="script">LN</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><
   mi mathvariant="bold">I</mi></mrow><mo>)</mo></mrow></mrow> :MATH]
   :
   [MATH: <mrow><msub><mi>δ</mi><mi mathvariant="normal">n</mi></msub><mo
   linebreak="badbreak">∼</mo><mi
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   i
   mathvariant="normal">I</mi></mrow><mo>)</mo></mrow><mo>,</mo><msub><mi>
   θ</mi><mi mathvariant="normal">n</mi></msub><mo
   linebreak="badbreak">=</mo><mtext>softmax</mtext><mrow><mo
   stretchy="true">(</mo><msub><mi>δ</mi><mi
   mathvariant="normal">n</mi></msub><mo stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mi>exp</mi><mspace
   width="0.25em"></mspace><mrow><mo
   stretchy="true">(</mo><msub><mi>δ</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">k</mi></mrow></msub><mo
   stretchy="true">)</mo></mrow></mrow><mrow><msubsup><mo>∑</mo><mrow><mi
   mathvariant="normal">k</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">K</mi></msubsup><mspace
   width="0.25em"></mspace><mi>exp</mi><mspace
   width="0.25em"></mspace><mrow><mo
   stretchy="true">(</mo><msub><mi>δ</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">k</mi></mrow></msub><mo
   stretchy="true">)</mo></mrow></mrow></mfrac><mtext>.</mtext></mrow>
   :MATH]
   (Equation 1)

   For each read
   [MATH:
   <mrow><msup><mi>i</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ><mo linebreak="goodbreak" linebreakstyle="after">∈</mo><mrow><mo
   stretchy="true">{</mo><mrow><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><ms
   ubsup><mi>D</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></
   msubsup></mrow><mo stretchy="true">}</mo></mrow></mrow> :MATH]
   from the
   [MATH:
   <mrow><msup><mi>m</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   modality
   [MATH:
   <mrow><msubsup><mi>w</mi><mrow><mi>n</mi><mo>,</mo><msup><mi>i</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup></mrow> :MATH]
   , draw a feature index
   [MATH:
   <mrow><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ></mrow> :MATH]
   (e.g., the particular transcript or open chromatin region the read was
   sequenced) from a categorical distribution
   [MATH: <mrow><mtext>Cat</mtext><mrow><mo
   stretchy="true">(</mo><msubsup><mi
   mathvariant="bold">r</mi><mrow><mi>n</mi><mo>,</mo><mo>.</mo></mrow><mr
   ow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   :
   [MATH:
   <mrow><msubsup><mi>w</mi><mrow><mi>n</mi><mo>,</mo><msup><mi>i</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="goodbreak">∼</mo><mrow><munderover><mo>∏</mo><mrow><msup><mi
   >v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi>V</mi><mrow><mo>(
   </mo><mi>m</mi><mo>)</mo></mrow></msup></munderover><msup><mrow><mo
   stretchy="true">[</mo><msubsup><mi>r</mi><mrow><mi>n</mi><mo>,</mo><mi>
   v</mi></mrow><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">]</mo></mrow><mrow><mo
   stretchy="true">[</mo><mrow><msubsup><mi>w</mi><mrow><mi>n</mi><mo>,</m
   o><msup><mi>i</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></m
   row><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">=</mo><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><m
   o>)</mo></mrow></msup></mrow><mo
   stretchy="true">]</mo></mrow></msup></mrow></mrow> :MATH]
   (Equation 2)

   where
   [MATH:
   <mrow><msubsup><mi>D</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup></mrow> :MATH]
   denotes the total number of reads. The expected rate
   [MATH:
   <mrow><msubsup><mi>r</mi><mrow><mi>n</mi><mo>,</mo><msup><mi>v</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup></mrow> :MATH]
   of observing feature
   [MATH:
   <mrow><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ></mrow> :MATH]
   in cell n is parameterized as:
   [MATH:
   <mrow><msubsup><mi>r</mi><mrow><mi>n</mi><mo>,</mo><msup><mi>v</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mi>exp</mi><mspace
   width="0.25em"></mspace><mrow><mo stretchy="true">(</mo><msubsup><mover
   accent="true"><mi>r</mi><mo>ˆ</mo></mover><mrow><mi>n</mi><mo>,</mo><ms
   up><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><
   mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">)</mo></mrow></mrow><mrow><msubsup><mo>∑</mo><mrow><msu
   p><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi>V</mi><mrow><mo>(
   </mo><mi>m</mi><mo>)</mo></mrow></msup></msubsup><mspace
   width="0.25em"></mspace><mi>exp</mi><mspace
   width="0.25em"></mspace><mrow><mo stretchy="true">(</mo><msubsup><mover
   accent="true"><mi>r</mi><mo>ˆ</mo></mover><mrow><mi>n</mi><mo>,</mo><ms
   up><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><
   mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">)</mo></mrow></mrow></mfrac><mo>,</mo><msubsup><mover
   accent="true"><mi>r</mi><mo>ˆ</mo></mover><mrow><mi>n</mi><mo>,</mo><ms
   up><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><
   mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">=</mo><msub><mi
   mathvariant="bold">θ</mi><mi>n</mi></msub><mi
   mathvariant="bold">α</mi><msubsup><mi
   mathvariant="bold">ρ</mi><mrow><mo>.</mo><mo>,</mo><msup><mi>v</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="goodbreak">+</mo><msubsup><mi>λ</mi><mrow><mi>s</mi><mrow><m
   o>(</mo><mi>n</mi><mo>)</mo></mrow><mo>,</mo><msup><mi>v</mi><mrow><mo>
   (</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><mi>m</m
   i><mo>)</mo></mrow></msubsup></mrow> :MATH]
   (Equation 3)

   where
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">ρ</mi><mrow><mo>.</mo><mo>,</mo><msup><mi>v</mi><mro
   w><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mrow><mi>L</mi><mo>×</mo><mn>1</mn><
   /mrow></msup></mrow> :MATH]
   denotes embedding of feature
   [MATH:
   <mrow><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ></mrow> :MATH]
   ,
   [MATH:
   <mrow><msubsup><mi>λ</mi><mrow><mi>s</mi><mrow><mo>(</mo><mi>n</mi><mo>
   )</mo></mrow><mo>,</mo><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)<
   /mo></mrow></msup></mrow><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   subsup></mrow> :MATH]
   is the batch-dependent and feature-specific scalar effect, where
   [MATH:
   <mrow><mi>s</mi><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow>
   :MATH]
   indicates the batch index for the
   [MATH:
   <mrow><msup><mi>n</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   cell. Notably, the softmax function normalizes the expected observation
   rates over all features separately within each modality to account for
   different modality size (e.g., there are more peaks than genes, and
   more transcripts than surface proteins). Another reason for the
   normalization is to capture feature sparsity (i.e., only a small
   fraction of features from each modality is non-zero). This is analogous
   to text mining, where only a small fraction of the unique words are
   draw from the entire vocabulary for any given document.

   The likelihood for cell n can be expressed as multinomial distribution:
   [MATH: <mrow><mi mathvariant="normal">p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi mathvariant="normal">w</mi><mi
   mathvariant="normal">n</mi></msub><mo>|</mo><msub><mi
   mathvariant="normal">r</mi><mi
   mathvariant="normal">n</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak">∝</mo><munderover><mo>∏</mo><mrow><mi
   mathvariant="normal">i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msubsup><mi
   mathvariant="normal">D</mi><mi
   mathvariant="normal">n</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup></munderover><mro
   w><munderover><mo>∏</mo><mrow><msup><mi
   mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi
   mathvariant="normal">V</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></munderover><mrow><
   msup><mrow><mo stretchy="true">[</mo><msubsup><mi
   mathvariant="normal">r</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">]</mo></mrow><mrow><mo
   stretchy="true">[</mo><mrow><msubsup><mi
   mathvariant="normal">w</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><msup><mi
   mathvariant="normal">i</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</
   mo><mi mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">=</mo><msup><mi
   mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></mrow><mo
   stretchy="true">]</mo></mrow></msup><mo
   linebreak="badbreak">=</mo></mrow></mrow><mrow><munderover><mo>∏</mo><m
   row><msup><mi mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi
   mathvariant="normal">V</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></munderover><mrow><
   msup><mrow><mo stretchy="true">[</mo><msubsup><mi
   mathvariant="normal">r</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">]</mo></mrow><mrow><msubsup><mo>∑</mo><mrow><mi
   mathvariant="normal">i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msubsup><mi
   mathvariant="normal">D</mi><mi
   mathvariant="normal">n</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup></msubsup><mrow><
   mo stretchy="true">[</mo><mrow><msubsup><mi
   mathvariant="normal">w</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><msup><mi
   mathvariant="normal">i</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></mrow><mrow><mo>(</
   mo><mi mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">=</mo><msup><mi
   mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></mrow><mo
   stretchy="true">]</mo></mrow></mrow></msup><mo
   linebreak="badbreak">=</mo></mrow></mrow><mrow><munderover><mo>∏</mo><m
   row><msup><mi mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi
   mathvariant="normal">V</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></munderover><msup><
   mrow><mo stretchy="true">[</mo><msubsup><mi
   mathvariant="normal">r</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">]</mo></mrow><msubsup><mi
   mathvariant="normal">x</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup></msup></mrow></m
   row> :MATH]

   where
   [MATH:
   <mrow><msubsup><mi>x</mi><mrow><mi>n</mi><mo>,</mo><mi>v</mi></mrow><mr
   ow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="goodbreak"
   linebreakstyle="after">=</mo><msubsup><mo>∑</mo><mrow><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msubsup><mi>D</mi><mi>n</m
   i><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup></msubsup><mrow>
   <mo
   stretchy="true">[</mo><mrow><msubsup><mi>w</mi><mrow><mi>n</mi><mo>,</m
   o><mi>i</mi></mrow><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup
   ><mo linebreak="badbreak">=</mo><mi>v</mi></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   denotes the read count for feature
   [MATH:
   <mrow><msup><mi>v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup
   ></mrow> :MATH]
   for cell n in the
   [MATH:
   <mrow><msup><mi>m</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   modality. As a result, we can more conveniently express the data
   likelihood in terms of the read count:
   [MATH: <mrow><mi>p</mi><mrow><mo stretchy="true">(</mo><mrow><msub><mi
   mathvariant="bold">x</mi><mi>n</mi></msub><mo>|</mo><msub><mi
   mathvariant="bold">r</mi><mi>n</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">=</mo><mrow><munderover><mo>∏</mo><mrow><msup><mi>
   v</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi>V</mi><mrow><mo>(
   </mo><mi>m</mi><mo>)</mo></mrow></msup></munderover><msup><mrow><mo
   stretchy="true">[</mo><msubsup><mi>r</mi><mrow><mi>n</mi><mo>,</mo><mi>
   v</mi></mrow><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">]</mo></mrow><msubsup><mi>x</mi><mrow><mi>n</mi><mo>,</
   mo><mi>v</mi></mrow><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsu
   p></msup></mrow></mrow> :MATH]
   (Equation 4)

moETM model inference

   For the ease of inference, we consider the cell topic embedding
   [MATH: <mrow><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub></mrow> :MATH]
   (before softmax normalization) for cells
   [MATH: <mrow><mi>n</mi><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><mrow><mo>{</mo><mrow><mn>1</mn><mo>,</mo>
   <mo>…</mo><mi>N</mi></mrow><mo>}</mo></mrow></mrow> :MATH]
   as the latent variables and all the cells are independent. The rest of
   the parameters including topic embedding
   [MATH: <mrow><mi mathvariant="bold">α</mi></mrow> :MATH]
   , feature embedding
   [MATH: <mrow><msubsup><mrow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   :MATH]
   , and batch-effect parameter
   [MATH: <mrow><msubsup><mrow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">λ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   :MATH]
   are treated as point estimates and learned by the model. Let’s denote
   [MATH: <mrow><mover accent="true"><mi
   mathvariant="normal">Θ</mi><mo>ˆ</mo></mover><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo
   stretchy="true">{</mo><mrow><mrow><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>,</mo><mi
   mathvariant="bold">α</mi><mo>,</mo><msubsup><mrow><mo
   stretchy="true">{</mo><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>,</
   mo><msubsup><mrow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">λ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   <mo stretchy="true">)</mo></mrow><mo
   stretchy="true">}</mo></mrow></mrow> :MATH]
   . A principled way to learn those parameters is to maximize the log
   marginal likelihood:
   [MATH: <mrow><msup><mi mathvariant="normal">Θ</mi><mo>∗</mo></msup><mo
   linebreak="goodbreak">←</mo><munder><mi>argmax</mi><mi
   mathvariant="normal">Θ</mi></munder><mrow><munder><mo>∑</mo><mi
   mathvariant="normal">n</mi></munder><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">n</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">}</mo></mrow><mrow><mi mathvariant="normal">m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">M</mi></msubsup><mo>∣</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">≡</mo></mrow></mrow><munder><mi>argmax</mi><mi
   mathvariant="normal">Θ</mi></munder><mrow><munder><mo>∑</mo><mi
   mathvariant="normal">n</mi></munder><msub><mi
   mathvariant="script">L</mi><mi
   mathvariant="normal">n</mi></msub></mrow></mrow> :MATH]

   However, this integral is not tractable. Instead, we took a variational
   inference approach to optimize the model parameters by maximizing an
   evidence lower bound (ELBO) of the marginal log likelihood with a
   proposed variational posterior
   [MATH: <mrow><mi>q</mi><mrow><mo stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   as a surrogate to the true posterior of the cell topic embedding
   [MATH: <mrow><mi>p</mi><mrow><mo stretchy="true">(</mo><mrow><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>∣</mo><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup></mrow>
   <mo stretchy="true">)</mo></mrow></mrow> :MATH]
   :
   [MATH: <mrow><msub><mi mathvariant="script">L</mi><mi>n</mi></msub><mo
   linebreak="badbreak">=</mo><mi>log</mi><mo>∫</mo><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>,</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>∣</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow><mi>d</mi><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub></mrow> :MATH]
   [MATH: <mrow><mo>=</mo><mi>log</mi><mo>∫</mo><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mfrac><mrow><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>,</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>∣</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></mfrac><mi>d</mi><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub></mrow> :MATH]
   [MATH: <mrow><mo>=</mo><mi>log</mi><mspace
   width="0.25em"></mspace><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo>[</mo><mfrac><mrow
   ><mi>p</mi><mrow><mo stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>,</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>∣</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></mfrac><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   [MATH: <mrow><mo>≥</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mfrac><mrow><mi>p</mi><mrow><m
   o stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>,</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>∣</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></mfrac></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   (Equation 5)
   [MATH: <mrow><mo>=</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>∣</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>,</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">−</mo><mi>log</mi><mspace
   width="0.25em"></mspace><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   [MATH: <mrow><mo>=</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>∣</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>,</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo linebreak="goodbreak">−</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mfrac><mrow><mi>q</mi><mrow><m
   o stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></mfrac></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   [MATH: <mrow><mo>=</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>∣</
   mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>,</mo><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo
   linebreak="badbreak">−</mo><mi>K</mi><mi>L</mi><mrow><mo
   stretchy="true">[</mo><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo>‖</mo><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo
   linebreak="goodbreak">≡</mo><mi>E</mi><mi>L</mi><mi>B</mi><msub><mi>O</
   mi><mi>n</mi></msub></mrow> :MATH]
   (Equation 6)

   where [279]Equation 5 follows the Jensen’s inequality[280]^69 and KL
   denotes the Kullback-Leibler (KL) divergence between the proposed
   distribution and the prior (i.e., standard Gaussian with zero mean and
   identity variance), acting as a regularization when maximizing the data
   likelihood.

   We defined the proposed distribution
   [MATH: <mrow><mi>q</mi><mrow><mo stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   as a product of Gaussians (PoGs):
   [MATH: <mrow><mi>q</mi><mrow><mo stretchy="true">(</mo><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo linebreak="badbreak">=</mo><mi
   mathvariant="script">N</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi
   mathvariant="bold">δ</mi><mi>n</mi></msub><mo>;</mo><msup><mi
   mathvariant="bold">μ</mi><mo>∗</mo></msup><mo>,</mo><msup><mi
   mathvariant="bold">σ</mi><mrow><mo>∗</mo><mn>2</mn></mrow></msup></mrow
   ><mo stretchy="true">)</mo></mrow><mtext>,</mtext></mrow> :MATH]
   (Equation 7)

   The mean
   [MATH: <mrow><mi mathvariant="bold">μ</mi></mrow> :MATH]
   and standard deviation
   [MATH: <mrow><mi mathvariant="bold">σ</mi></mrow> :MATH]
   of the joint Gaussian is computed as:
   [MATH: <mrow><msup><mi mathvariant="bold">μ</mi><mo>∗</mo></msup><mo
   linebreak="badbreak">=</mo><mfrac><mrow><msubsup><mo>∑</mo><mrow><mi>m<
   /mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msub><
   mi mathvariant="bold">μ</mi><mi>m</mi></msub><msubsup><mi
   mathvariant="bold">σ</mi><mi>m</mi><mn>2</mn></msubsup></mrow><mrow><mn
   >1</mn><mo
   linebreak="badbreak">+</mo><msubsup><mo>∑</mo><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msubsu
   p><mi
   mathvariant="bold">σ</mi><mi>m</mi><mn>2</mn></msubsup></mrow></mfrac><
   mo>,</mo><msup><mi
   mathvariant="bold">σ</mi><mrow><mo>∗</mo><mn>2</mn></mrow></msup><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msubsup><mo>∏</mo><mrow><mi>m
   </mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msubsu
   p><mi
   mathvariant="bold">σ</mi><mi>m</mi><mn>2</mn></msubsup></mrow><mrow><mn
   >1</mn><mo
   linebreak="badbreak">+</mo><msubsup><mo>∑</mo><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><msubsu
   p><mi
   mathvariant="bold">σ</mi><mi>m</mi><mn>2</mn></msubsup></mrow></mfrac><
   /mrow> :MATH]
   (Equation 8)

   where
   [MATH: <mrow><msub><mi
   mathvariant="bold">μ</mi><mi>m</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">σ</mi><mi>m</mi><mn>2</mn></msubsup></mrow> :MATH]
   are the mean and variance of the Gaussian latent embedding for the
   individual modalities, respectively. Those are output from the encoder
   neural network (NNET):
   [MATH: <mrow><mrow><mo stretchy="true">[</mo><mrow><msubsup><mi
   mathvariant="bold">μ</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo>;</mo><mi>log</mi><mspace
   width="0.25em"></mspace><msubsup><mi
   mathvariant="bold">σ</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup></mrow><mo stretchy="true">]</mo></mrow><mo
   linebreak="badbreak">=</mo><mtext>NNET</mtext><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mover accent="true"><mi
   mathvariant="bold">x</mi><mo>˜</mo></mover><mi>n</mi><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup><mo>;</mo><mi
   mathvariant="bold">W</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 9)

   where
   [MATH: <mrow><msubsup><mover accent="true"><mi
   mathvariant="bold">x</mi><mo>˜</mo></mover><mi>n</mi><mrow><mo>(</mo><m
   i>m</mi><mo>)</mo></mrow></msubsup></mrow> :MATH]
   is the normalized counts for each feature as the raw count of the
   feature divided by the total counts of
   [MATH:
   <mrow><msup><mi>m</mi><mrow><mi>t</mi><mi>h</mi></mrow></msup></mrow>
   :MATH]
   modality in cell n, and
   [MATH: <mrow><mi mathvariant="bold">W</mi></mrow> :MATH]
   is the parameters for a two-layer feedforward neural network.

   We approximate the above ELBO in [281]Equation 6 by sampling from the
   proposed joint Gaussian distribution using the reparameterization
   trick[282]^13:
   [MATH: <mrow><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   linebreak="badbreak">∼</mo><mi mathvariant="script">N</mi><mrow><mo
   stretchy="true">(</mo><mrow><msup><mi
   mathvariant="bold">μ</mi><mo>∗</mo></msup><mo>,</mo><mtext>diag</mtext>
   <mrow><mo stretchy="true">(</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">)</mo></mrow><mo linebreak="badbreak">=</mo><msup><mi
   mathvariant="bold">μ</mi><mo>∗</mo></msup><mo
   linebreak="goodbreak">+</mo><mtext>diag</mtext><mrow><mo
   stretchy="true">(</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">)</mo></mrow><mi
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   i mathvariant="bold">I</mi></mrow><mo>)</mo></mrow></mrow> :MATH]
   [MATH:
   <mrow><mi>E</mi><mi>L</mi><mi>B</mi><msubsup><mi>O</mi><mi>n</mi><mrow>
   <mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">≈</mo><mi>log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>∣</
   mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo>,</mo><
   mi mathvariant="normal">Θ</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak">−</mo><mi>K</mi><mi>L</mi><mrow><mo
   stretchy="true">[</mo><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo>‖</mo><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]

   where the KL divergence has closed form:
   [MATH: <mrow><mrow><mi>K</mi><mi>L</mi><mrow><mo
   stretchy="true">[</mo><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo>‖</mo><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo linebreak="badbreak">=</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo linebreak="badbreak">−</mo><msub><mi
   mathvariant="double-struck">E</mi><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow><mrow><mo
   linebreak="newline">=</mo><mrow><mo
   stretchy="true">[</mo><mrow><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfra
   c><mi>log</mi><mspace width="0.25em"></mspace><msup><mrow><mo
   stretchy="true">{</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>log</
   mi><mrow><mo>(</mo><mrow><mn>2</mn><mi>π</mi></mrow><mo>)</mo></mrow></
   mrow><mo stretchy="true">]</mo></mrow><mo
   linebreak="badbreak">−</mo><mrow><mo
   stretchy="true">[</mo><mrow><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfra
   c><mi>log</mi><mrow><mo>(</mo><mrow><mn>2</mn><mi>π</mi></mrow><mo>)</m
   o></mrow><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mr
   ow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">μ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mr
   ow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup></mrow><mo
   stretchy="true">]</mo></mrow></mrow><mrow><mo
   linebreak="newline">=</mo><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><msup><mrow><mo stretchy="true">{</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><msup><mrow><mo
   stretchy="true">{</mo><msup><mi
   mathvariant="bold">μ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><msup><mrow><mo
   stretchy="true">{</mo><msup><mi
   mathvariant="bold">σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">+</mo><mn>1</mn></mrow><mo
   stretchy="true">]</mo></mrow></mrow></mrow> :MATH]
   (Equation 10)

   Together, with the Multinomial likelihood defined in [283]4 and KL
   divergence in [284]10, we can express the ELBO in its approximate
   closed-form using the sampled latent variable:
   [MATH: <mrow><mrow><mi mathvariant="normal">E</mi><mi
   mathvariant="normal">L</mi><mi mathvariant="normal">B</mi><mi
   mathvariant="normal">O</mi><mo
   linebreak="badbreak">≈</mo><munderover><mo>∑</mo><mrow><mi
   mathvariant="normal">n</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">N</mi></munderover><mrow><munderover><mo>∑</mo><mr
   ow><mi mathvariant="normal">m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">M</mi></munderover><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">n</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">}</mo></mrow><mrow><mi mathvariant="normal">m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">M</mi></msubsup><mo>∣</mo><msub><mover
   accent="true"><mi>δ</mi><mo>˜</mo></mover><mi
   mathvariant="normal">n</mi></msub><mo>,</mo><mi
   mathvariant="normal">Θ</mi></mrow><mo stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">−</mo><mi mathvariant="normal">K</mi><mi
   mathvariant="normal">L</mi><mrow><mo stretchy="true">[</mo><mrow><mi
   mathvariant="normal">q</mi><mrow><mo stretchy="true">(</mo><msub><mover
   accent="true"><mi>δ</mi><mo>˜</mo></mover><mi
   mathvariant="normal">n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo>‖</mo><mi
   mathvariant="normal">p</mi><mrow><mo stretchy="true">(</mo><msub><mover
   accent="true"><mi>δ</mi><mo>˜</mo></mover><mi
   mathvariant="normal">n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow></mrow></mrow><mrow><mo
   linebreak="newline">=</mo><munderover><mo>∑</mo><mrow><mi
   mathvariant="normal">m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">M</mi></munderover><mrow><munderover><mo>∑</mo><mr
   ow><msup><mi mathvariant="normal">v</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><msup><mi
   mathvariant="normal">V</mi><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msup></munderover><mrow><
   msubsup><mi mathvariant="normal">x</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mspace
   width="0.25em"></mspace><mi>log</mi><mrow><mo
   stretchy="true">(</mo><msubsup><mi mathvariant="normal">r</mi><mrow><mi
   mathvariant="normal">n</mi><mo>,</mo><mi
   mathvariant="normal">v</mi></mrow><mrow><mo>(</mo><mi
   mathvariant="normal">m</mi><mo>)</mo></mrow></msubsup><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mrow><mo
   stretchy="true">[</mo><mrow><mi>log</mi><mspace
   width="0.25em"></mspace><msup><mrow><mo
   stretchy="true">{</mo><msup><mi>σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><msup><mrow><mo
   stretchy="true">{</mo><msup><mi>μ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">−</mo><msup><mrow><mo
   stretchy="true">{</mo><msup><mi>σ</mi><mo>∗</mo></msup><mo
   stretchy="true">}</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">+</mo><mn>1</mn></mrow><mo
   stretchy="true">]</mo></mrow></mrow></mrow></mrow></mrow> :MATH]
   (Equation 11)

   where
   [MATH:
   <mrow><msubsup><mi>r</mi><mrow><mi>n</mi><mo>,</mo><mi>v</mi></mrow><mr
   ow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></msubsup></mrow> :MATH]
   is defined in [285]3. The model parameters including the encoder weight
   [MATH: <mrow><mi mathvariant="bold">W</mi></mrow> :MATH]
   and the decoder weights
   [MATH: <mrow><mi mathvariant="normal">Θ</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo>{</mo><mrow><mi
   mathvariant="bold">α</mi><mo>,</mo><mi
   mathvariant="bold">ρ</mi></mrow><mo>}</mo></mrow></mrow> :MATH]
   are optimized by maximizing the above ELBO via backpropagation:
   [MATH: <mrow><msup><mi
   mathvariant="normal">Θ</mi><mo>∗</mo></msup><mo>,</mo><msup><mi
   mathvariant="bold">W</mi><mo>∗</mo></msup><mo
   linebreak="goodbreak">←</mo><munder><mi>argmax</mi><mrow><mi
   mathvariant="normal">Θ</mi><mo>,</mo><mi
   mathvariant="bold">W</mi></mrow></munder><mrow><munderover><mo>∑</mo><m
   row><mi>n</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mro
   w><mi>log</mi><mspace width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msubsup><mi
   mathvariant="bold">x</mi><mi>n</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo stretchy="true">}</mo></mrow><mrow><mi>m</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>M</mi></msubsup><mo>∣</
   mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo>,</mo><
   mi mathvariant="normal">Θ</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">−</mo><mi>K</mi><mi>L</mi><mrow><mo
   stretchy="true">[</mo><mrow><mi>q</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow><mo>‖</mo><mi>p</mi><mrow><mo
   stretchy="true">(</mo><msub><mover accent="true"><mi
   mathvariant="bold">δ</mi><mo>˜</mo></mover><mi>n</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow></mrow></mrow> :MATH]
   (Equation 12)

Single-cell multi-omic datasets and preprocessing

   There were 7 public datasets included in this study for performance
   evaluation and model comparison. All 7 datasets are from publicly
   available repositories. Among them, 4 datasets provide joint profiling
   of gene expression and open chromatin regions (denoted as “gene+peak”
   data).
     * 1.
       Multiome bone marrow mononuclear cells (BMMC1) dataset from the
       2021 NeurIPS challenge consisting of 42,492 cells with 22 cell
       types from 10 donors across 4 sites,[286]^70,
     * 2.
       SHARE-seq mouse skin late anagen (MSLAC) dataset containing 34,774
       cells with 1 batch and 23 cell types,[287]^24,
     * 3.
       sci-CAR mouse kidney cells (MKC) dataset from cell samples with 1
       batch and 14 cell types,[288]^71,
     * 4.
       SHARE-seq mouse brain cells (MBC) dataset containing 3,293 cells
       with 1 batch and 19 cell types.[289]^24

   For the BMMC1 dataset, we take into account two different batch types:
   one treats a subject (e.g., site1 + donor1 as a subject s1d1, site1 +
   donor2 as a subject s1d2, etc) as a batch (s1d1, s1d2, s1d3, s2d1,
   s2d4, s2d5, s3d3, s3d6, s3d7, s3d10, s4d1, s4d8, s4d9, 13 batches in
   total), while the other treats a site (site1 as batch1, site2 as
   batch2) as a batch (4 batches in total). For the CITE-seq data
   measuring transcriptome and surface protein in the same cell, 3
   datasets were used in this study.
     * 1.
       Bone marrow mononuclear cells (BMMC2) dataset from the 2021 NeurIPS
       challenge from 9 donors and 4 sites,[290]^70,
     * 2.
       Human White Blood Cell (HWBC) dataset containing 211,000 human
       peripheral blood mononuclear cells,[291]^12,
     * 3.
       Human Blood Immune Cell (HBIC) dataset[292]^57 measuring 647,366
       peripheral blood mononuclear cells from both COVID patients and
       healthy patients.

   Similarly, for the BMMC2 dataset, we consider two different batch
   types: one treats donors as batches (12 batches in total), while the
   other treats sites as batches (4 batches in total). All datasets were
   processed into the format of samples-by-features matrices. For
   gene+peak datasets, the read count for each gene and peak were first
   normalized per cell by total counts within the same omic using
   scanpy.pp.normalize_total function in the scanpy,[293]^72 then log1p
   transformation was applied. After that, scanpy.pp.highly_variable_genes
   was used to select highly variable genes or peaks. For the joint
   profiling of transcriptome and surface protein data (denoted as
   gene+protein), we used all surface proteins measured by the scADT-seq
   assay since the number of proteins is much smaller compared with the
   number of genes or peaks and all of them are highly informative of
   immune cell functions. The same normalization as in the gene+peak data
   was performed on the gene+protein data.

Cross-omic imputation

   The trained moETM can impute one omic from another omic. Suppose we
   have two omics namely omic A and omic B. For the training data where
   both omics are observed, moETM learns a shared topic embedding
   [MATH: <mrow><mi mathvariant="bold">α</mi></mrow> :MATH]
   and omic-specific feature embedding
   [MATH: <mrow><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>A</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   and
   [MATH: <mrow><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>B</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   . For the testing data, suppose without loss of generality that only
   omic B is observed. To impute omic A, moETM uses the encoder for
   modality B to generate the topic mixture, which is then input to the
   decoder for omic A to complete the imputation ([294]Figure 1C). We
   evaluated the imputation accuracy using the BMMC1 (gene+peak) and BMMC2
   (gene+protein) datasets based on (1) 60/40 random split of training and
   testing data with 500 repeats to get standard deviation estimate; (2)
   training on all batches except for one batch and testing on the
   held-out batch (leave-one-batch); (3) training on all cell types except
   for one cell type and testing on the held-out cells of that cell type
   (leave-one-cell-type).

Evaluation metrics

   The batch effects correction and biological variance conservation
   categories were used to assess the efficacy of the integration across
   multiple modalities. To quantify bio-conservation, we used the Adjusted
   Rand Index (ARI) and Normalized Mutual Information (NMI), and to
   measure batch effect removal, we used k-nearest-neighbor batch-effect
   test (kBET) and Graph Connectivity (GC). Specifically, ARI calculates
   the degree of similarity between two clusterings and adjusts for the
   possibility that objects can randomly form the same clusters. NMI
   normalizes the mutual information to a scale of 0–1. While NMI excels
   in unbalanced clustering or small clusters, ARI is better suited to
   clusters of similar size.[295]^73 kBET performs hypothesis testing on
   whether batch labels are distributed differently across cells based on
   Pearson’s
   [MATH: <mrow><msup><mi>χ</mi><mn>2</mn></msup></mrow> :MATH]
   test.[296]^19 GC measures whether cells of the same type from different
   batches are close to one another by computing a K nearest-neighbour
   graph based on the distance between cells in the embedding
   space.[297]^20

Quantification and statistical analysis

Linking genes to open chromatin regions

   We sought to investigate the relation between the top peaks and top
   genes under the same moETM topic (i.e.,
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">φ</mi><mi>k</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo>
   </mrow></msubsup><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi
   mathvariant="bold">α</mi><mi>k</mi></msub><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mi>m</mi><mo>)</mo></mrow></m
   sup></mrow> :MATH]
   for topic k and
   [MATH: <mrow><mi>m</mi><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><mrow><mo>{</mo><mrow><mi>g</mi><mi>e</mi>
   <mi>n</mi><mi>e</mi><mo>,</mo><mi>p</mi><mi>e</mi><mi>a</mi><mi>k</mi><
   /mrow><mo>}</mo></mrow></mrow> :MATH]
   ). To assess the in-cis relation, we measured the genomic distances
   between genes and peaks and designated genes that were near peaks as
   peak-neighboring-genes if they are within 150K base pairs (bp)
   distance. Specifically, we first obtained a genes-by-topics matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold">φ</mi><mrow><mo>(</mo><mtext>gene</mtext><mo>)</mo><
   /mrow></msup><mo linebreak="goodbreak" linebreakstyle="after">=</mo><mi
   mathvariant="bold">α</mi><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mtext>gene</mtext><mo>)</mo><
   /mrow></msup></mrow> :MATH]
   and a peaks-by-topics matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold">φ</mi><mrow><mo>(</mo><mtext>peak</mtext><mo>)</mo><
   /mrow></msup><mo linebreak="goodbreak" linebreakstyle="after">=</mo><mi
   mathvariant="bold">α</mi><msup><mi
   mathvariant="bold">ρ</mi><mrow><mo>(</mo><mtext>peak</mtext><mo>)</mo><
   /mrow></msup></mrow> :MATH]
   .

   To transform
   [MATH: <mrow><msup><mi
   mathvariant="bold">φ</mi><mrow><mo>(</mo><mtext>peak</mtext><mo>)</mo><
   /mrow></msup></mrow> :MATH]
   into a peak_to_genes-by-topics matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold">φ</mi><mrow><mo>(</mo><mrow><mtext>peaks</mtext><mo>
   _</mo><mtext>to</mtext><mo>_</mo><mtext>genes</mtext></mrow><mo>)</mo><
   /mrow></msup></mrow> :MATH]
   , we first derived a binary peaks-to-genes mapping matrix
   [MATH: <mrow><mi mathvariant="bold">H</mi></mrow> :MATH]
   with the entries
   [MATH:
   <mrow><msub><mi>h</mi><mrow><mi>p</mi><mo>,</mo><mi>g</mi></mrow></msub
   ><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>1</mn></mrow> :MATH]
   if the corresponding pair of peak p and gene g are within 150K bp
   genomic distance and are positively correlated and 0 otherwise. In
   detail, we computed the Pearson correlation between gene g and peak p
   in terms of their topic scores:
   [MATH:
   <mrow><msub><mi>r</mi><mrow><mi>p</mi><mo>,</mo><mi>g</mi></mrow></msub
   ><mo linebreak="badbreak">=</mo><mfrac><mrow><msup><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mi
   mathvariant="bold">φ</mi><mi>g</mi><mrow><mo>(</mo><mtext>gene</mtext><
   mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">−</mo><msubsup><mover accent="true"><mi
   mathvariant="bold">φ</mi><mo>¯</mo></mover><mi>g</mi><mrow><mo>(</mo><m
   text>gene</mtext><mo>)</mo></mrow></msubsup></mrow><mo
   stretchy="true">)</mo></mrow><mo>⊤</mo></msup><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mi
   mathvariant="bold">φ</mi><mi>p</mi><mrow><mo>(</mo><mtext>peak</mtext><
   mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">−</mo><msubsup><mover accent="true"><mi
   mathvariant="bold">φ</mi><mo>¯</mo></mover><mi>p</mi><mrow><mo>(</mo><m
   text>peak</mtext><mo>)</mo></mrow></msubsup></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mrow><mo>‖</mo><mrow><msubsu
   p><mi
   mathvariant="bold">φ</mi><mi>g</mi><mrow><mo>(</mo><mtext>gene</mtext><
   mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">−</mo><msubsup><mover accent="true"><mi
   mathvariant="bold">φ</mi><mo>¯</mo></mover><mi>g</mi><mrow><mo>(</mo><m
   text>gene</mtext><mo>)</mo></mrow></msubsup><msub><mo>‖</mo><mn>2</mn><
   /msub></mrow><mo>‖</mo></mrow><msubsup><mi
   mathvariant="bold">φ</mi><mi>p</mi><mrow><mo>(</mo><mtext>peak</mtext><
   mo>)</mo></mrow></msubsup><mo
   linebreak="badbreak">−</mo><msubsup><mover accent="true"><mi
   mathvariant="bold">φ</mi><mo>¯</mo></mover><mi>p</mi><mrow><mo>(</mo><m
   text>peak</mtext><mo>)</mo></mrow></msubsup><mo>)</mo><msub><mo>‖</mo><
   mn>2</mn></msub></mrow></mfrac></mrow> :MATH]

   The genome distance between peaks and genes was based on the latest
   genome build (i.e., hg38 for human) and obtained via the
   GenomicRanges[298]^74 package in R.

Pathway enrichment analysis

   For each moETM topic, we performed Gene Set Enrichment Analysis
   (GSEA)[299]^37 to associate the topic with known pathways or gene sets.
   In particular, we used each topic to query two gene sets from Molecular
   signatures database (MSigDB), which are the 5219 Immunologic signature
   gene sets (C7) and the 7763 Gene Ontology Biological Processes (BP)
   (C5-BP) terms. For each topic, we ran GSEAPreranked on a ranked list of
   genes based on their corresponding topic scores against every gene set
   from C7 or C5-BP, and calculated the enrichment score (ES) for over- or
   under-representation. The statistical significance of the ES was
   computed based on 1000 permutation test. The gene sets with
   Benjamini–Hochberg (BH) corrected p values lower than 0.05 were deemed
   significant. Similarly, for the scATAC-seq data, the peaks-by-topics
   matrix was first converted into a peaks_to_genes-by-topics matrix and
   then provide as input to GSEA pipeline.

Motif enrichment analysis of top peaks from moETM-learned topics

   To detect sequence-based regulatory elements for the cell-type-specific
   topics, we performed motif enrichment analysis using the top 100 peaks
   that exhibit the highest topic scores under each topic. The 100
   sequences corresponding to those top 100 peaks under each topic were
   extracted from Ensembl database and provided as input to the Simple
   Enrichment Analysis (SEA) pipeline[300]^75 from the MEME suite.[301]^76
   SEA utilizes the STREME motif discovery algorithm[302]^77 to identify
   known motifs that are enriched in input sequences. For our purpose, we
   used the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) Human
   (v11) and HOCOMOCO Mouse (v11) motif database.[303]^78 Motifs with
   Fisher’s exact test p values lower than 0.05 were selected as the
   enriched motifs.

Differential analysis to detect condition-specific topics

   We sought to detect moETM-topics that exhibit significantly higher
   scores for the conditions of interest such as cell types or phenotypes.
   Notably, while the cell types were at the single-cell level, the
   phenotypes were at the subject level (e.g., COVID-19 severity state).
   The latter means that the cells from the same subject were assigned the
   same phenotype label. For each dataset, we first split the cells into
   positive and negative groups, corresponding to the presence and absence
   of the target condition, respectively. For each topic, we assessed the
   statistical significance of the topic score increase for the positive
   group relative to the negative group based on one-sided Student’s t
   test. The topics with a Bonferroni-adjusted p value smaller than 0.001
   were considered significant with the label.

Incorporating pathway-informed gene embeddings

   In the linear decoder, we reconstruct the cells-by-features matrix by
   the dot product of the 3 matrices, namely cells-by-topics,
   topics-by-embedding, and embedding-by-features. By default, the last
   feature embedding matrix consist of learnable parameters. However, we
   can instill prior pathway information during the training of moETM by
   fixing the features embedding to a known gene set. As a result, the
   topics-by-embedding and embedding-by-features matrices change to
   topics-by-gene_sets and gene_sets-by-features with only the
   topics-by-gene_sets as the learnable parameters. This allows us to
   directly map each topic to each gene set, which may further improve the
   model interpretability especially if the chosen gene sets were highly
   relevant to the data. Given that several single-cell multi-omic
   datasets used in this study were derived from the blood, we utilized
   the Immunologic signature gene sets collection (C7) from the MSigDB
   database. Gene sets with fewer than five or more than 1000 genes were
   filtered out. We then converted the gene set information into a binary
   gene_sets-by-genes matrix with 0 and 1 indicating the absence and
   presence of the genes (columns) in the corresponding gene set (rows),
   respectively. We focused on the gene+peak case by fixing the gene
   embedding to the gene set while learning the peak embedding as in the
   default setting. We did not experiment this approach on the
   gene+protein case, for which the topics learned by the default moETM
   are sufficiently easy to interpret.

Acknowledgments