Graphical abstract
graphic file with name fx1.jpg
[34]Open in a new tab
Highlights
* •
moETM integrates multiple modalities to a low-dimensional
representation
* •
moETM accurately achieves cross-omics imputation
* •
moETM-inferred topics can elucidate cell-type signatures and gene
regulatory programs
* •
moETM reveals the molecular basis of disease severity on a COVID-19
CITE-seq
Motivation
Single-cell technologies are advancing to enable profiling of multiple
modalities within individual cells. However, the sparsity and noise in
the data poses challenges for analysis. Most existing computational
methods for analyzing single-cell data are either limited to single
modality or lack flexibility and interpretability. In this study, we
introduce moETM, a unified deep learning model that integrates
single-cell multi-omics data by projecting them onto a common topic
mixture representation. Furthermore, moETM employs a linear decoder
design, which facilitates the interpretability and the discovery of
biologically significant patterns.
__________________________________________________________________
Zhou et al. develop moETM to integrate single-cell multi-omics data by
leveraging product-of-experts framework to infer latent topics and
linear decoders to learn multi-omics features. moETM aims to accomplish
several objectives, including clustering and identifying sub-cell
types, cross-omics imputation, and identifying cell-type signatures.
Introduction
Multi-omics single-cell high-throughput sequencing technologies open up
new opportunities to interrogate cell-type-specific gene regulatory
programs. Single-cell RNA sequencing (scRNA-seq) combined with assay
for transposase-accessible chromatin using sequencing (ATAC-seq)[35]^1
simultaneously measure the transcriptome and chromatin accessibility in
the same cell. Cellular indexing of transcriptomes and epitopes by
sequencing (CITE-seq)[36]^2 measures surface protein and transcriptome
data using oligonucleotide-labeled antibodies. By integrating the
information from these multiple omics, we can expand our understanding
of the genome regulation from multiple perspectives.
However, extracting meaningful biological patterns from the
fast-growing multi-omics single-cell data remains a challenge due to
several factors.[37]^3^,[38]^4 Firstly, the cell yield of multi-omics
single-cell technologies is lower compared with the single-omics
technologies such as scRNA-seq. On the other hand, the combined feature
dimension is much higher (e.g., genes and peaks). This requires a more
deliberate model design that can flexibly distill meaningful cell-type
signatures from the multi-modal data while not overfitting the data.
Secondly, multi-omics single-cell data are noisier compared with
bulk-level or single-cell single-omics data. This calls for a
probabilistic model that can infer latent cell types while properly
accounting for the statistical uncertainty. Thirdly, the batch effects
make it challenging to distinguish biological signals from
study-specific confounders. Finally, multi-omics single-cell data are
more costly compared with scRNA-seq or scATAC-seq alone. It is
therefore highly cost-effective if we can profile single-omics data and
then predict the unobserved omics data.
Recently, several computational methods were developed to tackle the
above multi-modality data-integration challenges encountered in
multi-omics single-cell data analysis. For instance, SMILE[39]^5
integrates multi-omics data by minimizing the mutual information of the
latent representations among the modalities and batches. The
totalVI[40]^6 and multiVI[41]^7 integrate CITE-seq data and scRNA +
scATAC data via variational autoencoder (VAE) frameworks, respectively.
Cobolt[42]^8 is a hierarchical Bayesian generative model to integrate
cell modalities. scMM[43]^9 is a mixture-of-experts (MoE) model
developed to impute one missing modality conditioned on the other.
Multigrate[44]^10 adopted a product-of-experts framework to integrate
multi-omics data. MOFA+[45]^11 uses mean-field variational Bayes and
coordinate ascent to fit a Bayesian group factor analysis model to
integrate the multi-omics data. Seurat V4[46]^12 integrated multimodal
single-cell data through the weighted nearest neighbor algorithm. While
many of these methods conferred promising performances in some of the
tasks such as cell clustering or modality imputation, they often need
to compromise scalability, interpretability, and/or flexibility. In
particular, when a neural network is used to encode the
high-dimensional multi-omics data, interpretability is traded for
flexibility; when a linear model or independent feature assumption is
made, flexibility is traded for interpretability and scalability.
However, all three are important to reveal cell-type-specific
multi-omics signatures that are indicative of gene regulatory programs
from large-scale data. Furthermore, most of these methods are entirely
data driven and therefore incapable of fully utilizing the existing
biological information such as gene annotations or pathway information.
In this study, we present a multi-omics embedded topic model (moETM) to
integrate multiple molecular modalities at the single-cell level. As
one of the main technical contributions, moETM uses product-of-experts
to infer latent topics underlying the single-cell multi-omics data and
a set of linear decoders to learn shared embedding of topics and
multi-omics features that can accurately reconstruct the
high-dimensional multi-omics data from their low-dimensional latent
topic space ([47]Figure 1A). Through effectively integrating multiple
modalities from multi-omics single-cell sequencing data, moETM seeks to
achieve three tasks: (1) clustering cells into biologically meaningful
clusters to identify the sub-cell type indicative of phenotype of
interests ([48]Figure 1B), (2) imputing one omics using the other omics
([49]Figure 1C), and (3) identifying cell-type signatures, which serve
as biomarkers for a target phenotype ([50]Figure 1D). Through
comprehensive experiments on seven single-cell multi-omics datasets, we
demonstrate moETM’s ability comparatively with six state-of-the-art
computational methods. We further showcase how moETM facilitates the
analysis of the COVID-19 single-cell CITE-seq dataset. Quantitatively,
we observe that moETM learns the joint embeddings from multiple
modalities with better or comparable bio-conservation, batch effect
correction, and crossmodality imputation compared with the existing
methods.[51]^5^,[52]^6^,[53]^7^,[54]^8^,[55]^9^,[56]^11^,[57]^12
Furthermore, the topic embedding learned by moETM can be used to gain
biological insights into the cell-type-specific multi-omics regulatory
elements.
Figure 1.
[58]Figure 1
[59]Open in a new tab
moETM model overview
(A) Modeling single-cell multi-omics data across batches. For details,
see STAR Methods.
(B) Evaluating moETM through cell clustering. The integration
performance of moETM is evaluated by clustering cells based on their
topic proportion and qualitatively evaluated by UMAP visualization.
(C) Cross-modality imputation.
(D) Downstream topic analysis. The learned topics-by-{cells, genes,
proteins, peaks} matrices enable identifying cell-type-specific topics,
gene signatures, surface protein signatures, and regulatory network
motifs, respectively.
Results
moETM model overview
As an overview, moETM integrates multi-omics data across different
experiments or studies with interpretable latent embeddings
([60]Figure 1). It is built upon the widely used VAE[61]^13 to model
multi-modal data ([62]Figure 1A). However, to tailor the VAE framework
for the single-cell multi-omics data, we made two main contributions on
both the encoder and the decoder of the VAE.
The encoder in moETM is a two-layer fully connected neural network,
which infers topic proportion from multi-omics normalized count vectors
for a cell. We assume the latent representation of each omics follows a
K-dimensional independent logistic normal distribution. Our goal is to
effectively combine these distributions into a joint distribution of
the multi-omics data. To this end, we take the product of the
K-dimensional Gaussians, i.e., product-of-Gaussians (PoG). Because the
PoG is also a Gaussian density function, we can represent the joint
latent distribution in closed form. In principle, this results in a
tighter evidence lower bound (ELBO) and therefore more efficient
variational inference compared with the MoE approaches[63]^14 as
adopted in MultiVI/TotalVI[64]^6^,[65]^7 and scMM.[66]^9 In particular,
these MoE approaches sample K-dimensional Gaussian variables for each
omics and then take their average. In contrast, our PoG formalism
requires sampling only once from the joint Gaussian. Therefore, moETM
may confer more robust estimates thanks to the reduced sampling noise
from the Monte Carlo approximation procedure. We perform a softmax
transformation on the joint Gaussian density. The resulting logistic
normal distribution can be considered as a topic mixture membership for
the cell. These topics can be directly mapped to known cell types based
on their top gene signatures detected from our linear decoder as the
topic distribution must sum to 1 over the K topics; the inferred topic
mixture membership of a cell expresses statistical uncertainty in the
cell embedding.
On the decoder side, inherited from our earlier work,[67]^15 moETM
employs a linear matrix factorization to reconstruct the normalized
count vectors from the cell embedding. Our working hypothesis is that
the encoder creates a linearly separable space for the decoder to
achieve a good reconstruction when the two networks are trained
end-to-end. Specifically, the decoder factorizes the cell-by-feature
matrices into a shared cell-by-topic matrix
[MATH: Θ :MATH]
, a shared topic-embedding matrix
[MATH: α :MATH]
, and M separate feature-embedding matrices
[MATH: ρ(m) :MATH]
, where
[MATH: m∈{1,
…,M} :MATH]
indexes the omics. Since different omics share the same cells-by-topics
matrix but have their own feature-embedding matrices, we can explore
the relations among cells, topics, and features in a highly
interpretable way. This departs from the existing VAE models such as
scMM,[68]^9 BABEL,[69]^16 and Multigrate[70]^10 that used a neural
network as the decoder. Another main challenge in single-cell data
analysis is the batch effects, which are sources of technical
variation. To account for those, we introduced the omics-specific batch
removal factors
[MATH: λ(m)∈RV(m)×S
mrow> :MATH]
for each omics m ([71]Figure 1A), which act as linear-additive
batch-specific biases in reconstructing each modality. By regressing
out the batch effects via
[MATH: λ(·) :MATH]
, moETM can learn biologically meaningful representations in terms of
the cell topic mixture and the topic/feature embedding. As detailed in
[72]STAR Methods, all the parameters in moETM are learned end-to-end by
maximizing a common objective function defined as the ELBO of the
marginal data likelihood under the framework of amortized variational
inference.
Multi-omics integration
We performed quantitative evaluations of moETM on the integrated
low-dimensional representation compared with six state-of-the-art
multi-omics integration methods (SMILE,[73]^5 scMM,[74]^9 Cobolt,[75]^8
MultiVI/TotalVI,[76]^6^,[77]^7 MOFA+,[78]^11 and Seurat V4[79]^12) on
seven published datasets. Four out of the seven datasets are
single-cell transcriptome and chromatin accessibility (gene + peak)
datasets and the other three are single-cell transcriptome and surface
protein expression (gene/transcript + protein) datasets measured by
CITE-seq.
The performance of the multi-omics integrative task was based on both
biological conservation metrics and batch removal metrics (STAR
Methods). For the biological conservation score, we adopted the common
metrics including Adjusted Rand Index (ARI)[80]^17 and Normalized
Mutual Information (NMI[81]^18). For evaluating batch effect removal,
we used k-nearest neighbor batch effect test (kBET)[82]^19 with graph
connectivity (GC).
To make a comprehensive comparison, we used three experimental
settings: (1) 60/40 random split for training and testing with 500
repeats ([83]Tables S1 and [84]S2), (2) training and testing both on
the whole dataset ([85]Table S3), and (3) training and testing across
different batches ([86]Tables S4 and [87]S5). The number of topics was
set to 100 during the training based on the robust performance
([88]Figure S1). Overall, we obtained consistent results across all
three settings and therefore chose to focus on describing the results
based on the first setting.
We observed that moETM achieved the best overall performance when
averaging over all or gene + protein datasets’ performance scores among
three out of four evaluation metrics ([89]Figure 2). It conferred the
second-highest averaged kBET, which is only marginally behind
multiVI/totalVI. The latter two methods might have over-corrected the
batch effect at the expense of biological conservation. When averaging
over gene + peak datasets, moETM can still achieve the best performance
among two out of four evaluation metrics. Specifically, moETM ranked
the second highest on ARI and NMI and slightly behind Seurat V4, which
has a larger standard deviation compared with moETM.
Figure 2.
[90]Figure 2
[91]Open in a new tab
Methods comparison based on cell clustering
The left column illustrates the individual performance of each method
on each dataset. The seven datasets are indicated on the x axis with
gene + peak datasets colored in blue and gene + protein datasets
colored in black. Within each dataset, the highest value was labeled on
the top of the corresponding bar. The middle column is the comparison
of averaging values across datasets for each method. The right column
is the comparison between moETM and its three ablated versions. Each
row represents an evaluation metric.
(A) Adjusted Rand Index (ARI).
(B) Normalized Mutual Information (NMI).
(C) k-Nearest neighbor batch effect test (kBET).
(D) Graph connectivity (GC).
For individual datasets, moETM is either the best or the second best
method on six out of seven datasets (except MBC) for different
experimental settings in terms of the ARI ([92]Figure 2A; [93]Tables S1
and [94]S2). One possible reason could be that the sample size of MBC
(3,293 cells) from which moETM learns high-dimensional peak embeddings
is small compared with the other 6 datasets. To assess the benefits of
the added features in moETM, we compared moETM with its ablated
versions (moETM_rna, moETM_atac, and moETM_protein), where moETM was
trained on a single omic. As expected, the performance of moETM on
single modality decreased, indicating that moETM could improve its
performance by leveraging multiple modalities ([95]Figure 2, right
panel).
Similar quantitative conclusions can be drawn based on NMI
([96]Figure 2B; [97]Tables S1 and [98]S2). For kBET ([99]Figure 2C),
moETM is the best for the BMMC1 dataset and the second best on the
other datasets—slightly behind MultiVI/TotalVI. Therefore, while moETM
conferred higher biological conservation scores in terms of ARI and
NMI, it still maintains comparable kBET scores on all four datasets
compared with MultiVI/TotalVI. Indeed, we observed an excellent balance
between the biological conservation and batch effect removal because
moETM achieved notably higher GC compared with all other methods
([100]Figure 2D). This is because GC is the only metric that is based
on both cell types and batch labels by measuring the similarity among
cells of the same type from different batches based on the embedding
learned by each method.[101]^20
We postulated that the main reason for moETM’s superior integration
performance is its PoG formulation. To that end, we constructed
moETM_avg, which replaced PoG with averaging of sampled variables from
individual Gaussian distributions similar to the existing VAE models
such as scMM.[102]^9 As expected, the performance of moETM_avg was
worse than moETM in all datasets in terms of all evaluation metrics
([103]Figure 2, right panel). Furthermore, since scMM also adopted the
average of the Gaussian samples from the encoder, the fact that
moETM_avg outperformed scMM indicates the benefits of using the linear
decoder, which further improves the multi-omics integration while
correcting batch effects across all cells.
We further verified the clustering performance by visualizing the cell
topic mixture embeddings using Uniform Manifold Approximation and
Projection (UMAP)[104]^21 ([105]Figure 3). Indeed, not only did moETM
remove batch effects but also revealed a better representation of cell
type clusters. For example, “Plasmablast IGKC–” cells were grouped
closely by moETM but were clustered into multiple small parts by SMILE
([106]Figure 3A). Moreover, plasmablast cells from different batches
were also mixed better by moETM compared with SMILE, which indicated a
better batch effects correction. “CD4+ T activated” and “CD4+ T naïve”
cells were closer within the same cluster but clearly distinguishable
between themselves. In contrast, these two cell types were mixed
together by SMILE and scMM. In modeling the BMMC1 dataset (gene +
peak), “B1 B” cells and “naive CD20^+ B” cells ([107]Figure 3B) were
mixed by other methods while better separated by moETM.
Figure 3.
[108]Figure 3
[109]Open in a new tab
UMAP visualization of cell clustering
(A) UMAP visualization of moETM, SMILE, and scMM on single-cell
CITE-seq from BMMC2 dataset colored by batches (upper panel) and cell
types (lower panel).
(B) UMAP visualization of moETM, SMILE, and scMM on the gene + peak
multi-omics data from the BMMC1 dataset.
In addition, we visually compared the cell clustering by the individual
modalities with that by the integrated modalities via UMAP
([110]Figure S2). Similarly, we also compared the cell clustering based
on the individual omics of single-cell transcriptome and surface
proteins with the clustering based on the integrated RNA + protein
topic mixture using the BMMC2 dataset. We observed that the integrated
topic representation led to a more coherent cell clustering compared
with the topic mixture of individual omics ([111]Figure S2). For
example, the “CD14^+ Mono” cells were grouped more closely by the
integrated topic mixture compared with their unimodal counterparts
([112]Figure S2A; ARI: 0.735 by RNA + peak in contrast to 0.690 by RNA
only and 0.648 by peak only). Similarly, the plasmablast IGKC– cells
also formed tighter clusters in the integrated RNA + protein embedding
space ([113]Figure S2B; ARI: 0.734 by RNA + protein in contrast to
0.688 by transcriptome only and 0.590 by surface proteins only).
Therefore, moETM was able to improve cell clustering by integrating
multiple modalities. Taken together, these results show that moETM is
able to distinguish similar cell types by capturing biological
information in its encoding space while removing batch effects.
Cross-omics imputation
In the case of gene + protein, moETM accurately imputes surface protein
expression from gene expression for the BMMC2 dataset, achieving
average Pearson (Spearman) correlation of 0.95, 0.92, and 0.88 (0.94,
0.90, and 0.85) on random split, leave-one-batch, and
leave-one-cell-type imputation experiments, respectively
([114]Table S6). We visualized the reconstructed protein expression
against the observed values using the BMMC2 (gene + protein) dataset
([115]Figure 4A). The imputed protein expression is highly linearly
correlated with the observed one ([116]Figure 4B), which is what we
expected given the high Pearson correlation of 0.95. The runner up
methods—namely, scMM and BABEL—also performed well on this task, both
achieving a correlation score of 0.94.
Figure 4.
[117]Figure 4
[118]Open in a new tab
Cross-omics imputation
(A) Heatmap of original protein and imputed protein values from gene
expression using the BMMC2 CITE-seq dataset. The column and row orders
are the same for the two heatmaps.
(B) Scatterplot of original and imputed surface protein expression. The
diagonal line is in blue color.
(C and D) Heatmap and scatterplot of the original and imputed gene
expression from chromatin accessibility on the BMMC1 dataset. For more
results, see [119]Figure S3.
Compared with the surface protein imputation task, imputing gene
expression from the open chromatin regions is a more challenging task
because of the sparser input scATAC-seq signals and the dynamic and
often asynchronous interplay between the chromatin states and the
transcriptome.[120]^22^,[121]^23^,[122]^24 Nonetheless, moETM achieved
relatively high Pearson (and Spearman) correlation scores of 0.69,
0.65, and 0.58 (and 0.37, 0.35, and 0.32) on random split,
leave-one-batch, and leave-one-cell-type experiments. These are notably
higher than the corresponding correlation obtained by BABEL (Pearson:
0.65, 0.60, 0.55; Spearman: 0.34, 0.33, 0.30) and scMM (Pearson: 0.63,
0.61, 0.54; Spearman: 0.33, 0.33, 0.28) ([123]Table S6). Qualitatively,
the imputed and the observed gene expression profiles also exhibit
similar pattern and linear relationship ([124]Figures 4C and 4D).
In the previous two imputation applications, low-dimensional modalities
were generated from high-dimensional modalities. The imputation from
the low dimension to the high dimension is more difficult but
nonetheless feasible. Specifically, on the three same experimental
designs, the Pearson (and Spearman) correlations between the observed
and the imputed open chromatin regions from gene expression are 0.58,
0.55, and 0.51 (and 0.33, 0.30, and 0.28); the Pearson (and Spearman)
correlation between the observed and imputed gene expression from
protein expression are 0.65, 0.63, and 0.60 (and 0.41, 0.39, and 0.37)
([125]Table S6). In contrast, the runner-up method scMM achieved
Pearson (and Spearman) correlations of 0.40, 0.29, and 0.37 (and 0.29,
0.25, and 0.21) for imputing chromatin accessibility from gene
expression. For imputing gene expression from surface protein, scMM and
BABEL also fell behind moETM in terms of both Pearson and Spearman
correlations ([126]Table S6). Qualitatively, the imputed and the
observed peaks and gene expression exhibit consistent patterns
([127]Figures S3A and S3C) and strong linear trends ([128]Figure S3B
and S3D).
Correlating RNA transcripts with surface proteins and in cis chromatin
accessibility regions
As a proof-of-concept, we sought to assess whether the top surface
proteins can be mapped to the top genes under the same topic (i.e.,
following the central dogma). To this end, we trained a 100-topic moETM
on the BMMC2 (gene + protein) dataset generated by CITE-seq over 90,000
cells. For each topic, we calculated the Spearman correlation of topic
scores between the 134 pairs of the gene transcripts and the
corresponding translated surface proteins ([129]Figure 5A). The
correlations ranged from −0.096 to 0.751 with an average of 0.29. In
particular, 96 of the 100 topics have positive correlations. Among
them, 13 topics have correlations larger than 0.5.
Figure 5.
[130]Figure 5
[131]Open in a new tab
Topic analysis of gene + protein CITE-seq data
(A) Protein-RNA correlations and pathway enrichments for the 100 topics
learned from the CITE-seq BMMC2 data. In the middle and the bottom
panels, dots correspond to the tested immunologic signature gene sets
from MSigDB. Different colors represent different gene sets.
(B) Topics embedding of 10,000 sub-sampled cells from the BMMC2
dataset. Only the topics with the sum of absolute values greater than
the third quartile across all sampled cells were shown. The two color
bars display two tiers of annotations for the 9 broad cell types
(cellType1) and 45 fine-grained cell types (cellType2).
(C) GSEA leading-edge analysis of topic 40. The left and right panel
represent significantly enriched gene sets (q value
[MATH: < :MATH]
0.001) based on gene topic scores and protein topic scores,
respectively.
(D) Genes and proteins signatures of the select topics. The upper and
lower panels display the topics-by-genes and topics-by-proteins
heatmap. The top genes and proteins that are known cell-type markers
based on CellMarker or literature search are highlighted in blue.
(E) UMAP visualization of the genes, proteins, and topics via their
shared embedding space. The corresponding topic indices and
gene/protein symbols were highlighted by corresponding colors.
To further quantify the known transcript-protein as well as gene-peak
regulatory relations, we computed their Spearman correlations across
topics inferred from the BMMC datasets. We paired a peak with a gene if
it was within 150k bp distance from the transcription start site of the
gene. The overall distribution of the correlations for
transcript-protein pairs and gene-peak pairs were both significantly
higher than 0 (p
[MATH: < :MATH]
2.2e−16; one-sample t test) and comparable with the correlations
calculated directly from the observed data ([132]Figures S4A and S4C).
Notably, 90% of transcript-protein pairs exhibited positive
correlations and nine pairs displayed correlations exceeding 0.5
([133]Figure S4E). Nonetheless, several transcript-protein pairs
exhibited low or negative correlations. Several factors could
contribute to these low correlations. Firstly, random noise may hinder
correlations between genes and proteins. Secondly, dynamic cellular
processes at the single-cell level can give rise to variations between
cells, leading to a decrease in correlations.[134]^25 For example,
transcriptional bursting or delays between transcription and
translation will lead to asynchronous behavior of gene and protein
during the cell cycle, thereby reducing the correlations between gene
and protein expression levels.[135]^25 Particularly, a number of
transcript-protein pairs displayed negative correlations
([136]Figure S4E). This phenomenon has also been observed in previous
studies.[137]^26^,[138]^27^,[139]^28 For instance, Li et al. reported a
mismatch between mRNA and protein expression levels, including the
CD69-CD69 pair.[140]^27 One possible cause might be due to the impact
of other biological processes overriding the effects of
transcription.[141]^28 Taking CD69-CD69 as an example, the CD69 gene
may undergo post-translational modifications such as differential
glycosylation.[142]^29 The transcribed CD69 mRNA molecules can be
translated to a 22.5 kDa polypeptide, which can further be
differentially glycosylated to two different subunits. These subunits
can be randomly combined to form different receptors, leading to a
reduction in the abundance of the CD69 protein.[143]^29 If the
influence of post-translational modifications surpasses the impact of
protein synthesis, it can give rise to a negative correlation.
Furthermore, the CD69 mRNA transcripts are unstable. The level of CD69
mRNA could decline rapidly with a half-life of less than
60 min.[144]^29 While mRNA molecules degrade over time, protein levels
may maintain relatively stable or continue to accumulate. If the rate
of mRNA degradation surpasses that of protein synthesis, a negative
correlation could emerge.
In addition to investigating correlations across topics, we calculated
correlations across cells by computing Spearman correlations in terms
of observed values and reconstructed values based on the BMMC datasets
([145]Figures S4A–S4D). The correlations based on both the observed and
reconstructed data across cells were significantly greater than 0,
indicating consistent relations among transcript-protein and gene-peak
pairs captured at the cell level. However, the correlations from the
reconstructed values are higher than those from the observed values
([146]Figures S4B and S4D). This is because the observed values may
contain random noise or batch effects compared with reconstructed
values by moETM, which can be considered as denoised and
confounder-corrected values of the gene/protein/peak signals.
Immune cell-type signatures revealed by multi-omics topics learned from
CITE-seq data
To identify cell-type signatures, we associated each topic with the
specific cell type that exhibit the highest average topic score across
cells. Notably, not all topics were uniquely associated with one single
cell type and some topics might be enriched for a combination of
multiple cell types. Therefore, we chose to describe a selected subset
of the topics based on their distinctly enriched cell types and heatmap
visualization ([147]Figure 5B).
For instance, topic 44 was associated with monocytes and consists of
CD14^+ and CD16^+ Mono; topic 40 was associated with B cells and
consists of primarily naive CD20^+ B IGKC^+ and naive CD20^+ B IGKC^–
cells; topic 83 was associated with natural killer cells. These are
visually easy to detect from the topic mixture probabilities among the
individual cells ([148]Figure 5B).
Under each cell-type-enriched topic, many top genes and top proteins
are the known cell-type markers ([149]Figure 5D). For example, under
topic 40 (i.e., a B cell topic), the top genes CR2, SSPN, and ADAM28
are known marker genes for B cells; the top proteins CD21, CD20, and
CD40 are also marker proteins for B cells according to the CellMarker
database.[150]^30 For topic 7, one of top proteins CD11c is a marker
protein for dendritic cells (DCs).[151]^30 For topic 83, protein CD16,
marker for natural killer cells, is among its top proteins.[152]^30 For
topic 44, the top gene S100A9’s coding protein is a chemotactic factor
for monocytes[153]^31 and is highly expressed during inflammatory
processes[154]^32; among the top proteins for topic 44, CD36,[155]^33
CD33, and CD11c[156]^34 are also markers for monocyte sub-cell types.
Similarly, the monocyte is also enriched in topic 23, which shares the
top marker protein CD16 with topic 44 but also contains unique top
genes such as CDKN1C and FCGR3A. While CDKN1C is a known marker gene
for monocyte,[157]^35 FCGR3A is upregulated in CD16^+ monocytes as
supported by the existing literature.[158]^36
Moreover, we performed gene set enrichment analysis
(GSEA)[159]^37^,[160]^38 using the topic scores for all of the genes
and proteins. Because BMMCs are immune cells, we queried the
C7 ImmuneSigDB from MSigDb, which is a collection of 5,219 gene sets
related to immune pathways.[161]^39^,[162]^40^,[163]^41 Across all
100 topics, we identified 2,569 enriched gene sets with q values
[MATH: < :MATH]
0.05 using gene topic scores and 22 enriched gene sets using protein
topic scores ([164]Figure 5A). For example, in topic 40, using the gene
topic scores, we found a gene set that consists of upregulated genes in
B cells with respect to monocytes[165]^42 ([166]Figure 5C, left panel);
using the protein topic scores, we found a gene set that consists of
upregulated genes in B cells compared with plasmacytoid DCs
([167]Figure 5C, right panel).[168]^42
Furthermore, we projected the topic embeddings and feature embeddings
onto a common 2D space using UMAP ([169]Figure 5E). We observed that
the top marker genes and the top marker proteins for the cell type
clustered together around the corresponding topics, implying a
well-aligned embedding space within and across these modalities.
Together, the results suggested that the cell-type-enriched topics
inferred by moETM from the CITE-seq data reveal meaningful biological
relations between genes and proteins.
Joint multi-omics topic analysis identified cell-type-specific pathways and
regulatory motifs
The topic embedding learned from the scRNA + scATAC data enables us to
investigate the relationship between top genes and top peaks in the
cell-type-specific topics. Given that many top genes are known
cell-type markers ([170]Figure 6A), we postulated that the top peaks
could be associated with the top genes via in cis or in trans
regulatory elements. One challenge in interpreting the gene + peak
multi-omics topics is that peaks cannot be matched directly with genes.
We proposed two approaches to solve this issue. One is to link peaks to
their nearby genes to obtain the peak-neighboring genes (STAR Methods).
The other approach is to identify enriched motifs among the top peaks
and explore the relationship between genes and motifs via the
corresponding transcription factors (TFs) and their target genes.
Figure 6.
[171]Figure 6
[172]Open in a new tab
Topic analysis of single-cell gene + peak data from the BMMC1 dataset
(A) Top genes and top peak-neighbor genes of the select topics.
(B) Topic embedding of cells from the BMMC1 dataset.
(C) GSEA leading edge analysis of topic 3. The left panel is the GSEA
result using gene topic scores and the right panel is the GSEA result
using peak-neighboring gene topic scores.
(D) Pathway enrichment and motif enrichment for the 100 topics.
(E) Topic-directed regulatory networks based on motif enrichment
analysis. The yellow edges indicate known TF-target associations based
on ENCODE TF Targets dataset.
(F) Topic-directed regulatory networks incorporating enriched pathways,
motifs, and top genes. For more results, see [173]Figure S5.
For the first approach, the top genes and top peak-neighboring genes in
the select topics served as markers for the cell-type-specific gene
regulatory programs ([174]Figure 6A). For example, topic 32 is
associated with CD8^+ T cells ([175]Figures 6A and 6B). We zoomed-in
the topic by examining its top genes and top peaks. Three of the top 5
genes (TNFRSF9, ASTL, GZMK, DUSP2, and DGKH) were related to T cells.
In particular, GZMK is a marker gene for T cells based on
CellMarker[176]^30; TNFRSF9 codes for a signaling protein that promotes
expression of cytokines in CD8^+ T cells[177]^43; DUSP2 encodes an
inducible nuclear protein and is highly expressed in T cells.[178]^44
Among the top 5 peak-neighboring genes (APBA2, PRDX2, KLRC4, OBSCN, and
XCL2), APBA2 is a marker gene for cytotoxic CD8^+ T
lymphocytes[179]^45; XCL2 expression levels substantially increased in
CD8^+ T cells during T cell activation.[180]^46 As another example,
topic 3 is associated with CD4^+ T naive cells. Three out of the top 5
genes (CCR4, ADAM12, PTPN13, MB21D2, and IL4I1) and two out of the top
5 peak-neighboring genes (INPP4B, CCR4, PRDX2, RORA, and HIST1H2BD) are
related to T cells. Indeed, CCR4 is shown to be specifically expressed
among naive CD4^+ T cells[181]^47; ADAM12 is expressed in T cells in
the inflamed brain and is a potential target for the treatment of
Th1-mediated diseases[182]^48; IL4I1 increases the threshold of T cell
activation and partially modulates CD4 T cell differentiation.[183]^49
For top peak-neighboring genes, RORA is upregulated among the activated
CD4^+ T cells.[184]^50
To gain further mechanistic understanding of the inferred topics, we
performed GSEA on the topic scores for the genes from the transcriptome
modality and the topic scores for the peak-neighboring genes from the
chromatin accessibility modality ([185]Figure 6D). Many enriched gene
sets are related to the topic-associated cell types. For topic 3, for
instance, one of the enriched gene sets based on the gene topic scores,
is upregulated in healthy CD4 T cells compared with healthy myeloid
cells[186]^51 ([187]Figure 6C). This is consistent with an enriched
gene set from the peak-neighboring gene analysis of topic 3, where the
gene set consists of a set of genes that were upregulated in naive CD4
T cells relative to the DC.[188]^52 Therefore, GSEA further confirmed
the cell-type-specific functions of the top genes and peak-neighboring
genes identified via moETM’s topics. Interestingly, the top transcripts
and the top peak-neighboring genes do not often correspond to the same
genes. This implies that the peaks and genes provide complementary
information to (sometimes the same) cell-type-specific regulatory
programs. Therefore, by effectively integrating the scRNA-seq and
scATAC-seq data, the inferred multi-omics topics can reveal functional
convergence at the pathway level.
Besides using peak-neighboring genes, as the second approach we also
performed motif enrichment analysis on the top 100 peaks per topic
(STAR Methods; [189]Figure 6D). We then constructed a putative
regulatory network by linking the top genes and the enriched motifs via
their associated topics ([190]Figures 6E and 6F). Interestingly, some
of the top genes harbor those enriched motifs, implying that these
genes are the putative target genes of the cognate TF. In topic 3, for
example, one of the enriched motifs corresponds to a TF named FLI1 (p =
0.00117), and the top genes IL4l1 and PTPN13 are target genes of FLI1
based on the ENCODE Transcription Factor Targets.[191]^53^,[192]^54 As
another example, one of the enriched motifs for topic 32 corresponds to
TF MEF2A (p = 5.21e-5), whose target genes include the top genes RGS1,
EGR1, GZMK, ASTL, and DUSP2.[193]^53^,[194]^54
We further expanded our topic-network analysis by including enriched
pathways and cell type information ([195]Figures 6E and [196]S5). We
defined the intra-connections within the same topic as edges between
the topic nodes and cell type nodes. We also established
inter-connections between genes and external nodes including motifs and
pathways. Specifically, the top genes under each topic could serve as
members of enriched pathways or target genes for enriched motifs. For
instance, in topic 32, gene DUSP2 is a target gene for enriched motifs
MEF2A (p = 5.21e-5, permutation test) and PAX5 (p = 1.53e-5,
permutation test), while also being a member gene in four enriched
pathways and three of them are upregulated gene sets in the enriched
cell type of T cells (UNSTIM_VS._ACD3_ACD28_STIM_WT_CD4_TCELL_DN,
NKT_CELL_VS._ALPHAALPHA_CD8_TCELL_DN,
UNSTIM_VS._ACD3_ACD28_STIM_WT_CD4_TCELL_DN). Similarly, in topic 30,
gene FCAR is a target gene for one enriched motif CEBPB (p = 1.27e−5,
permutation test) and a member of three enriched pathways, two of which
are related to gene sets upregulated in the enriched cell type of
monocyte (MONOCYTE_VS._MDC_UP,
PBMC_MRKAD5_HIV_1_GAG_POL_NEF_AGE_20_50YO_1DY_UP,
MONOCYTE_VS._MDC_DAY7_FLU_VACCINE_UP). Likewise, in topic 3, gene DPP4
is a target gene for two enriched motifs PAX5 (p = 5.39e-4, permutation
test) and SP1 (p = 1.13e-4, permutation test) and a member of 11
enriched pathways where four of them (NAIVE_TCELL_VS._NKCELL_UP,
NAIVE_TCELL_VS._MONOCYTE_UP, NAIVE_CD4_TCELL_VS._MONOCYTE_UP,
NAIVE_CD4_TCELL_VS._DC_UP) are upregulated gene sets in CD4^+ T naive
cells. Those connections highlighted a consistent regulatory
relationship across motifs and pathways under inferred topics.
Therefore, our multi-omics topic analysis suggests that some of the
cell-type-specific regulatory programs are implicated with the sequence
motifs and pathways. Further investigation is needed to establish the
hierarchical relation between the TF and the cell lineage.
Prior pathway-informed enrichment
The single-cell multi-omics data are high-dimensional, sparse, and
noisy. This is especially the case for the scRNA + scATAC-seq data
because of the large number of genes and open chromatin regions. One
way to further improve the interpretability of the topics derived from
these data is by incorporating prior knowledge such as gene sets or
pathway information. In the context of our moETM, this was done by
fixing the embeddings-by-genes parameters to the observed
pathways-by-genes matrix (STAR Methods). Using the
[MATH: ∼ :MATH]
7,000 gene ontology biological process terms as the pathways-by-genes
matrix, we trained the pathway-informed moETM (p-moETM) on the BMMC1
gene + peak dataset.
Quantitatively, p-moETM can achieve comparable cell-clustering
performance with an ARI of 0.72, which is only slightly lower than the
default moETM that learned the gene embedding directly from the data
([197]Table S1). We also identified several cell-type-specific topics
along with their top genes and peaks ([198]Figure S6C–S6H). Notably,
the learned topics-by-embeddings matrix
[MATH: α :MATH]
from p-moETM is essentially the topics-by-pathways matrix. This allows
us to directly identify the top pathways for each topic without
performing post-hoc GSEA. For instance, topic 25 is associated with B1
B cells ([199]Figure S6C). One of its top pathways is related to B cell
activation ([200]Figure S6A). As another example, topic 8 was enriched
for the CD4^+ T activated cells, and one of its top pathways was
connected to the T cell apoptotic process.
For some topics, their top genes are both the members of the pathway
and the cell-type biomarkers ([201]Figure S6, left panel). For
instance, topic 27 is enriched in the CD4^+ T naive cells. One of its
top genes CCR7 is involved in the elimination process of immature
T cells. In addition, topic 41 is enriched for the transitional B cell.
Its top pathways include B cell activation and adaptive immune process.
Among its top genes, TNFAIP3 is in the B cell activation-related
pathway. One of its top peaks in chr14: 100207793–100208735 is upstream
of the promoter of YY1 (chr14: 100238298–100282788), which is a gene
member in the B cell activation-related pathway.[202]^55
Furthermore, we experimented using a more specific gene set namely the
immune signature gene set collections from MSigDB to investigate
immune-related pathways implicated in the BMMC1 dataset
([203]Figure S6, right panel). We identified several cell-type-specific
topics that exhibit high scores for meaningful immune pathways. For
instance, topic 23 is enriched in naive CD20^+ B cells. Two of its top
10 pathways are associated with naive B cells. One of its top genes
namely HLA-DPB1 is upregulated in naive B cells relative to the plasma
cells.[204]^56 One of the top peaks (chr12: 8886393–8887019) is
upstream of PHC1 (chr12:8913896–8941467), which is also involved in the
pathway where genes are upregulated in naive B cells relative to the
plasma cells.[205]^56
Multi-omics topics reveal the molecular basis of COVID-19 severity
As the CITE-seq technology interrogates the expression of surface
proteins along with the full transcriptome, it is a promising platform
to investigate the immune responses among patients infected by the
SARS-CoV-2 virus (COVID-19). Using moETM, we sought to identify
clinically relevant molecular signatures from a COVID-19 CITE-seq
dataset (HBIC).[206]^57 The data consist of 781,123 cells from 130
COVID-19 patients with varying degrees of severity due to the viral
infection. To establish model confidence, we first performed a
quantitative analysis as above. The results showed that moETM could
achieve either the highest or the second-highest evaluation metrics
both in bio-conservation and batch removal cases ([207]Table S2). In
particular, moETM ranked first with an ARI value of 0.752. Similarly,
moETM and TotalVI attained the highest NMI scores of 0.779 and 0.762,
respectively. Both methods also maintained their top performance in
terms of batch correction with TotalVI achieving the highest kBET of
0.197, while moETM came in second with 0.153. Consistent to the above
evaluation ([208]Table S2), moETM obtained the best GC score of 0.950,
whereas TotalVI achieved the second best of 0.934. Therefore, these
quantitative results on the COVID-19 data further suggest that moETM
strikes a good balance between biological conservation and batch effect
correction.
Qualitatively, we investigated the top features and identified enriched
cell types under each topic ([209]Figures S7A and S7B). In particular,
topic 42 is enriched for B cells. Among its top 5 genes (SLC38A11,
TCL1B, IL6, TCL1A, SYN3), IL6 and TCL1A are the known marker genes.
Also, three out of its top 5 proteins (CD19, CR1, CD22, FCGR2A, and
BAFFR) are marker proteins for B cells. Topic 31 is associated with
platelets. Two out of its top 5 genes (LYVE1, RADIL, VWF,[210]^58
TRHDE, and PPBP) are marker genes, namely VWF and PPBP, and one of its
top 5 proteins (ITGA2B, KIR3DL1, ITGAX, SELP, and FCGR2A) is a marker
protein (i.e., ITGA2B) for platelets. In addition, a previous study has
suggested that SELP redistributes to the plasma membrane during
platelet activation.[211]^59 The enriched pathways based on GSEA are
consistent to the cell-type specificity of those topics
([212]Figure S7C). Taking topic 42 as an example, the enriched pathway
is the gene set that is downregulated in CD4 T cells compared with B
cells.[213]^51 Because of the shared embedding space, we also observed
localization of the top genes and the top proteins for the selected
topics via UMAP ([214]Figure S7D).
We then leveraged the phenotype severity information among the patients
to explore gene and protein signatures related to the COVID-19
phenotypes. Specifically, we utilized COVID metadata information to
test whether a topic is significantly over-represented for the severity
conditions. Here, we considered each topic as a “meta-gene” and
associated their upregulation or downregulation with the disease
phenotypes ([215]Figures 7A and 7B). We delved into topics based on
their distinctly enriched cell types, heatmap visualization
([216]Figure S7A), and differential analysis by the phenotypes (e.g.,
COVID severity) (STAR Methods). For example, we observed that topic 42
is not only enriched for B cell but also upregulated among patients
with critical COVID status, whereas topic 80 is significantly
associated with the severe status. Moreover, topic 42 is associated
with other demographic features such as age and mainly enriched in the
senior group between 70 and 79 years ([217]Figure 7A).
Figure 7.
[218]Figure 7
[219]Open in a new tab
Topic association with the COVID-19 severity status
(A) Differential analysis of severity states, sex, smoking history, and
age. The color intensity values correspond to the differences of
average topic scores between the positive cells and negative cells for
each attribute and each topic. Asterisks indicate Bonferroni-adjusted p
value
[MATH: < :MATH]
0.001 based on one-sided t test of upregulated topics for each label.
(B) Differential analysis of topics across cell types. The heatmap on
the left displays the topic associations with each of the 18 cell
types, and the one on the right associates the same topics with 6
fine-grained B cell subtypes.
(C) UMAP visualization of cell clustering. The right panel shows a
zoom-in version of the B cell clustering with color indicating the six
B cell subtypes.
(D) UMAP visualization with cells colored by source subjects’ severity
states due to COVID-19 infection.
(E) Normalized gene expression of IL6 among the cells on the same UMAP.
Given its disease relevance, we further investigated topic 42 to see
whether it elucidates more granular cell types and to some extent
whether their top gene/protein signatures can serve as putative
biomarkers for COVID critical conditions. First of all, the
moETM-inferred cell topic embeddings did not only cluster cells into
their primary cell types but also sub-divided B cells into six
sub-clusters of known sub-cell types ([220]Figure 7C and zoom-in view).
Intriguingly, aligning the COVID phenotypes with the B cell sub-types
revealed that the critical COVID condition corresponded to B malignant
cells ([221]Figures 7C and 7D). B cell lymphomas start to develop when
B lymphocytes, which are in charge of humoral immunity, start to
proliferate beyond control. This proliferation turns B cells into
malignant cells.[222]^60 The previous study[223]^61 suggested that
individuals with certain cancers, such as lymphoma, may be more
susceptible to getting severe illness from COVID-19. The top gene IL6
in topic 42 was consistently expressed at a high level among B cells,
including although not specifically in B malignant cells
([224]Figure 7E). IL6 levels were commonly reported in severely ill
patients due to COVID-19.[225]^62^,[226]^63 As another example, topic
21 is also enriched in B malignant cells ([227]Figure 7B). One of its
top proteins CD5 ([228]Figure S7B) was shown to be highly expressed on
malignant cells.[229]^64 Moreover, the previous study[230]^65 suggested
that the proportion of CD5^+ B cells was significantly reduced in COVID
patients. Taken together, our results suggest that IL6 or CD5 may be a
potential therapeutic target.
Discussion
Gene regulatory programs involve multi-faceted regulation and cannot be
understood via a single-omics approach alone. As these technologies
continue to evolve, computational methods are needed to account for the
challenges in modeling the sparse, noisy, and heterogeneous nature of
data that are being generated at a rapid pace.[231]^3 In this study, we
developed a unified interpretable deep learning model called moETM to
integrate single-cell multi-omics data including transcriptome and
chromatin accessibility or surface proteins, which are the most common
types of single-cell multi-omics data to date.[232]^4
Our technical contributions are 3-fold. First, via the
product-of-experts, moETM effectively integrates multiple omics by
projecting them onto a common topic mixture representation. Second, the
linear decoder enables the extraction of multi-omics signatures as the
top features under each latent topic, which directly reveal marker
genes and phenotype markers under topics that are aligned with cell
types or phenotype conditions. Third, by efficiently correcting batch
effects via a dedicated linear intercept matrix in the decoder, we can
integrate multi-omics data from multiple studies, subjects, or
technologies, which allows us to exploit the vast amount of multi-omics
data to obtain biologically diverse and coherent multi-omics topics.
To demonstrate the utility of moETM, we benchmarked it with six
existing state-of-the-art computational methods on seven published
datasets including four gene + peak datasets and three gene + protein
datasets ([233]Tables S1 and [234]S2). Across all datasets, moETM
achieved competitive performance based on four common evaluation
metrics. We also confirmed the advantage of using multiple modalities
compared with single modality in terms of cell clustering. Moreover,
because of its joint modeling capabilities, the trained moETM can
accomplish this cross-omics imputation task. In both imputation
directions, moETM achieved a higher correlation than scMM and
BABEL. Although more challenging, moETM also achieved a reasonable
performance when imputing high dimensions from low dimensions.
We also explored the moETM-learned cell-type-specific topics in terms
of their top omics features and enriched pathways in the light of the
supporting evidence from the literature. moETM is able to detect immune
cell-type signatures and identify cell-type-specific pathways and
regulatory motifs. In a more focused study, we analyzed the COVID-19
CITE-seq dataset (gene + protein) and linked moETM-learned
immune-specific topics with patient severity conditions due to the
infection. Our topic analysis revealed not only immune marker genes but
also cell types that are associated with COVID phenotype conditions.
Limitations of the study
There are several challenges that are not addressed in moETM.[235]^4
For instance, moETM has the capacity to integrate across multiple
batches and modalities, but it requires the training data to have all
omics measured in the same cells. A more challenging task is to
integrate multimodal data without anchored features or cells, which is
commonly known as the diagonal integration.[236]^4 Some recent
approaches made use of graph representation learning to integrate
multi-omics single-cell data at the expense of computational complexity
and interpretability.[237]^66^,[238]^67^,[239]^68
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
BMMC CITE-seq Luecken et al.[240]^69 [241]GSE194122
BMMC Multiome data Luecken et al.[242]^69 [243]GSE194122
Mouse skin cell SHARE-seq Ma et al.[244]^24 [245]GSE140203
Mouse brain cell SHARE-seq Ma et al.[246]^24 [247]GSE140203
Mouse kidney cell sci-CAR Cao et al.[248]^70 [249]GSE117089
PBMC CITE-seq Hao et al.[250]^12 [251]GSE164378
COVID-19 PBMC CITE-seq Stephenson et al.[252]^57
[253]https://www.covid19cellatlas.org/
__________________________________________________________________
Software and algorithms
__________________________________________________________________
moETM This paper
[254]https://github.com/manqizhou/moETM[255]https://doi.org/10.5281/zen
odo.8104798
scanpy Wolf et al.[256]^71 [257]https://github.com/scverse/scanpy
anndata Virshup et al.[258]^72 [259]https://github.com/scverse/anndata
biomaRt Durinck et al.[260]^73 [261]https://rdrr.io/bioc/biomaRt/man/
seurat Hao et al.[262]^12 [263]https://satijalab.org/seurat/
totalVI Gayoso et al.[264]^6
[265]https://github.com/YosefLab/scvi-tools
SMILE Xu et al.[266]^5 [267]https://github.com/rpmccordlab/SMILE
scMM Minoura et al.[268]^9 [269]https://github.com/kodaim1115/scMM
Cobolt Gong et al.[270]^8 [271]https://github.com/epurdom/cobolt
MultiVI Ashuach et al.[272]^7 [273]https://zenodo.org/record/5762077
MOFA+ Argelaguet et al.[274]^11 [275]https://github.com/bioFAM/MOFA2
[276]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources should be directed to
and will be fulfilled by the lead contact, Yue Li
([277]yueli@cs.mcgill.ca).
Materials availability
This study did not generate new unique reagents.
Method details
moETM data generative process
The molecular activities in each cell n can be measured with M omics,
such as gene expression from transcriptome, surface protein expression,
and the open chromatin regions manifested as peaks. For the ease of the
following descriptions, we define the entities of genes, proteins and
peaks as “features”. Profiling those omics in the cell leads to M count
vectors
[MATH: {xn(m)
}m=1M
:MATH]
, each of which has a dimension
[MATH:
V(m) :MATH]
as the number of unique features in omic m. Adapting the text-mining
analogy, we consider each cell as a “document” written in M languages
or modalities (i.e., transcriptome, proteome, chromatin accessibility);
each feature from the
[MATH:
mth
:MATH]
omic is considered as a “word” from the
[MATH:
mth
:MATH]
vocabulary; each sequencing read is a “token” in the document; the
abundance of the reads mapped to the same feature is the “word count”
in the document.
The multi-modal document of a cell n can be summarized into a mixture
of K latent topics
[MATH: θn :MATH]
, which are presumably implicated in each modality ([278]Figure 1A).
Inference of these topic mixtures for each cell is accomplished by
modeling the distribution of the multi-omic count data
[MATH: {xn(m)
}m=1M
:MATH]
from the topic mixture for the cell and learning the global topic
embedding over the M modalities. The latter are shared among all cells
and expressed as M matrices
[MATH: {Φ(m)∈RK×V(m)}m=1M
:MATH]
, where a row vector
[MATH: φk(m)
∈RV(m
) :MATH]
denotes the k-th topic from the m-th modality.
To increase information sharing across the omics and the model
expressiveness, we further decompose each omic-specific topic embedding
matrix
[MATH: Φ(m) :MATH]
into the topic embedding
[MATH: α∈RK×L<
/mrow> :MATH]
and feature embedding
[MATH: ρ(m)∈RL×V
(m)
mrow> :MATH]
, where L denotes the size of the embedding space. The expected values
for the count data for each omic is proportional to the dot product of
the cell embedding, topic embedding matrix, and feature embedding
matrix:
[MATH: xn(m)
∝θnαρ(m) :MATH]
.
Formally, we formulate the data generative process as follows. For each
cell indexed by
[MATH: n∈{1,
…,N} :MATH]
, draw a
[MATH: 1×K :MATH]
topic proportion
[MATH: θn :MATH]
from logistic normal distribution
[MATH: θn∼LN(0,<
mi mathvariant="bold">I) :MATH]
:
[MATH: δn∼N(0,I),
θn=softmax(δn)=exp(δn,k)∑k=1Kexp(δn,k).
:MATH]
(Equation 1)
For each read
[MATH:
i(m)∈{1,…,Dn(m)
msubsup>} :MATH]
from the
[MATH:
mth
:MATH]
modality
[MATH:
wn,i(m)(m) :MATH]
, draw a feature index
[MATH:
v(m) :MATH]
(e.g., the particular transcript or open chromatin region the read was
sequenced) from a categorical distribution
[MATH: Cat(rn,.(m)) :MATH]
:
[MATH:
wn,i(m)(m)∼∏v(m)=1V(
m)[rn,
v(m)][wn,i(m)(m)=v(m)] :MATH]
(Equation 2)
where
[MATH:
Dn(m)
:MATH]
denotes the total number of reads. The expected rate
[MATH:
rn,v(m)(m) :MATH]
of observing feature
[MATH:
v(m) :MATH]
in cell n is parameterized as:
[MATH:
rn,v(m)(m)=exp(rˆn,v(m)<
mrow>(m))∑v(m)=1V(
m)exp(rˆn,v(m)<
mrow>(m)),rˆn,v(m)<
mrow>(m)=θnαρ.,v(m)(m)+λs(n),v
(m)(m) :MATH]
(Equation 3)
where
[MATH: ρ.,v(m)(m)∈RL×1<
/mrow> :MATH]
denotes embedding of feature
[MATH:
v(m) :MATH]
,
[MATH:
λs(n
),v(m)<
/mo>(m) :MATH]
is the batch-dependent and feature-specific scalar effect, where
[MATH:
s(n)
:MATH]
indicates the batch index for the
[MATH:
nth
:MATH]
cell. Notably, the softmax function normalizes the expected observation
rates over all features separately within each modality to account for
different modality size (e.g., there are more peaks than genes, and
more transcripts than surface proteins). Another reason for the
normalization is to capture feature sparsity (i.e., only a small
fraction of features from each modality is non-zero). This is analogous
to text mining, where only a small fraction of the unique words are
draw from the entire vocabulary for any given document.
The likelihood for cell n can be expressed as multinomial distribution:
[MATH: p(wn|rn)∝∏i=1Dn(m)∏v(m)=1V(m)<
msup>[rn,v(m)][wn,i(m)(
mo>m)=v(m)]=∏v(m)=1V(m)<
msup>[rn,v(m)]∑i=1Dn(m)<
mo stretchy="true">[wn,i(m)(
mo>m)=v(m)]=∏v(m)=1V(m)<
mrow>[rn,v(m)]xn,v(m) :MATH]
where
[MATH:
xn,v(m)=∑i=1Dn(m)
[wn,i(m)=v] :MATH]
denotes the read count for feature
[MATH:
v(m) :MATH]
for cell n in the
[MATH:
mth
:MATH]
modality. As a result, we can more conveniently express the data
likelihood in terms of the read count:
[MATH: p(xn|rn)=∏
v(m)=1V(
m)[rn,
v(m)]xn,
mo>v(m) :MATH]
(Equation 4)
moETM model inference
For the ease of inference, we consider the cell topic embedding
[MATH: δn :MATH]
(before softmax normalization) for cells
[MATH: n∈{1,
…N} :MATH]
as the latent variables and all the cells are independent. The rest of
the parameters including topic embedding
[MATH: α :MATH]
, feature embedding
[MATH: {ρ(m)}m=1M
:MATH]
, and batch-effect parameter
[MATH: {λ(m)}m=1M
:MATH]
are treated as point estimates and learned by the model. Let’s denote
[MATH: Θˆ={δn,α,{ρ(m)}m=1M,
mo>{λ(m)}m=1M
)} :MATH]
. A principled way to learn those parameters is to maximize the log
marginal likelihood:
[MATH: Θ∗←argmaxΘ∑nlogp({xn(m)}m=1M∣Θ∗)≡argmaxΘ∑nLn :MATH]
However, this integral is not tractable. Instead, we took a variational
inference approach to optimize the model parameters by maximizing an
evidence lower bound (ELBO) of the marginal log likelihood with a
proposed variational posterior
[MATH: q(δn) :MATH]
as a surrogate to the true posterior of the cell topic embedding
[MATH: p(δn∣{xn(m)
}m=1M
) :MATH]
:
[MATH: Ln=log∫p({xn(m)
}m=1M,
mo>δn∣Θ∗)dδn :MATH]
[MATH: =log∫q(δn)p({xn(m)
}m=1M,
mo>δn∣Θ∗)q(δn)dδn :MATH]
[MATH: =logEq(δn)[p({xn(m)
}m=1M,
mo>δn∣Θ∗)q(δn)] :MATH]
[MATH: ≥Eq(δn)[logp({xn(m)
}m=1M,
mo>δn∣Θ∗)q(δn)] :MATH]
(Equation 5)
[MATH: =Eq(δn)[logp({xn(m)
}m=1M∣
mo>δn,Θ∗)p(δn)−logq(δn)] :MATH]
[MATH: =Eq(δn)[logp({xn(m)
}m=1M∣
mo>δn,Θ∗)]−Eq(δn)[logq(δn)p(δn)] :MATH]
[MATH: =Eq(δn)[logp({xn(m)
}m=1M∣
mo>δn,Θ∗)]−KL[q(δn)‖p(δn)]≡ELBO
mi>n :MATH]
(Equation 6)
where [279]Equation 5 follows the Jensen’s inequality[280]^69 and KL
denotes the Kullback-Leibler (KL) divergence between the proposed
distribution and the prior (i.e., standard Gaussian with zero mean and
identity variance), acting as a regularization when maximizing the data
likelihood.
We defined the proposed distribution
[MATH: q(δn) :MATH]
as a product of Gaussians (PoGs):
[MATH: q(δn)=N(δn;μ∗,σ∗2), :MATH]
(Equation 7)
The mean
[MATH: μ :MATH]
and standard deviation
[MATH: σ :MATH]
of the joint Gaussian is computed as:
[MATH: μ∗=∑m<
/mi>=1M<
mi mathvariant="bold">μmσm21+∑m=1Mσm2<
mo>,σ∗2=∏m
=1Mσm21+∑m=1Mσm2<
/mrow> :MATH]
(Equation 8)
where
[MATH: μm :MATH]
and
[MATH: σm2 :MATH]
are the mean and variance of the Gaussian latent embedding for the
individual modalities, respectively. Those are output from the encoder
neural network (NNET):
[MATH: [μn(m)
;logσn(m)
]=NNET(x˜n(m);W) :MATH]
(Equation 9)
where
[MATH: x˜n(m) :MATH]
is the normalized counts for each feature as the raw count of the
feature divided by the total counts of
[MATH:
mth
:MATH]
modality in cell n, and
[MATH: W :MATH]
is the parameters for a two-layer feedforward neural network.
We approximate the above ELBO in [281]Equation 6 by sampling from the
proposed joint Gaussian distribution using the reparameterization
trick[282]^13:
[MATH: δ˜n∼N(μ∗,diag
(σ∗))=μ∗+diag(σ∗)N(0,I) :MATH]
[MATH:
ELBOn
(m)≈logp({xn(m)
}m=1M∣
mo>δ˜n,<
mi mathvariant="normal">Θ)−KL[q(δ˜n)‖p(δ˜n)] :MATH]
where the KL divergence has closed form:
[MATH: KL[q(δ˜n)‖p(δ˜n)]=Eq(δ˜n)[logq(δ˜n)]−Eq(δ˜n)[logp(δ˜n)]=[−12log{σ∗}2−12−12log
mi>(2π)
mrow>]−[−12log(2π)−12{μ∗}2−12{σ∗}2]=−12[log{σ∗}2−{μ∗}2−{σ∗}2+1] :MATH]
(Equation 10)
Together, with the Multinomial likelihood defined in [283]4 and KL
divergence in [284]10, we can express the ELBO in its approximate
closed-form using the sampled latent variable:
[MATH: ELBO≈∑n=1N∑m=1Mlogp({xn(m)}m=1M∣δ˜n,Θ)−KL[q(δ˜n)‖p(δ˜n)]=∑m=1M∑v(m)=1V(m)<
msubsup>xn,v(m)log(rn,v(m))−12[log{σ∗}2−{μ∗}2−{σ∗}2+1] :MATH]
(Equation 11)
where
[MATH:
rn,v(m) :MATH]
is defined in [285]3. The model parameters including the encoder weight
[MATH: W :MATH]
and the decoder weights
[MATH: Θ={α,ρ} :MATH]
are optimized by maximizing the above ELBO via backpropagation:
[MATH: Θ∗,W∗←argmaxΘ,W∑n=1Nlogp({xn(m)
}m=1M∣
mo>δ˜n,<
mi mathvariant="normal">Θ)−KL[q(δ˜n)‖p(δ˜n)] :MATH]
(Equation 12)
Single-cell multi-omic datasets and preprocessing
There were 7 public datasets included in this study for performance
evaluation and model comparison. All 7 datasets are from publicly
available repositories. Among them, 4 datasets provide joint profiling
of gene expression and open chromatin regions (denoted as “gene+peak”
data).
* 1.
Multiome bone marrow mononuclear cells (BMMC1) dataset from the
2021 NeurIPS challenge consisting of 42,492 cells with 22 cell
types from 10 donors across 4 sites,[286]^70,
* 2.
SHARE-seq mouse skin late anagen (MSLAC) dataset containing 34,774
cells with 1 batch and 23 cell types,[287]^24,
* 3.
sci-CAR mouse kidney cells (MKC) dataset from cell samples with 1
batch and 14 cell types,[288]^71,
* 4.
SHARE-seq mouse brain cells (MBC) dataset containing 3,293 cells
with 1 batch and 19 cell types.[289]^24
For the BMMC1 dataset, we take into account two different batch types:
one treats a subject (e.g., site1 + donor1 as a subject s1d1, site1 +
donor2 as a subject s1d2, etc) as a batch (s1d1, s1d2, s1d3, s2d1,
s2d4, s2d5, s3d3, s3d6, s3d7, s3d10, s4d1, s4d8, s4d9, 13 batches in
total), while the other treats a site (site1 as batch1, site2 as
batch2) as a batch (4 batches in total). For the CITE-seq data
measuring transcriptome and surface protein in the same cell, 3
datasets were used in this study.
* 1.
Bone marrow mononuclear cells (BMMC2) dataset from the 2021 NeurIPS
challenge from 9 donors and 4 sites,[290]^70,
* 2.
Human White Blood Cell (HWBC) dataset containing 211,000 human
peripheral blood mononuclear cells,[291]^12,
* 3.
Human Blood Immune Cell (HBIC) dataset[292]^57 measuring 647,366
peripheral blood mononuclear cells from both COVID patients and
healthy patients.
Similarly, for the BMMC2 dataset, we consider two different batch
types: one treats donors as batches (12 batches in total), while the
other treats sites as batches (4 batches in total). All datasets were
processed into the format of samples-by-features matrices. For
gene+peak datasets, the read count for each gene and peak were first
normalized per cell by total counts within the same omic using
scanpy.pp.normalize_total function in the scanpy,[293]^72 then log1p
transformation was applied. After that, scanpy.pp.highly_variable_genes
was used to select highly variable genes or peaks. For the joint
profiling of transcriptome and surface protein data (denoted as
gene+protein), we used all surface proteins measured by the scADT-seq
assay since the number of proteins is much smaller compared with the
number of genes or peaks and all of them are highly informative of
immune cell functions. The same normalization as in the gene+peak data
was performed on the gene+protein data.
Cross-omic imputation
The trained moETM can impute one omic from another omic. Suppose we
have two omics namely omic A and omic B. For the training data where
both omics are observed, moETM learns a shared topic embedding
[MATH: α :MATH]
and omic-specific feature embedding
[MATH: ρ(A) :MATH]
and
[MATH: ρ(B) :MATH]
. For the testing data, suppose without loss of generality that only
omic B is observed. To impute omic A, moETM uses the encoder for
modality B to generate the topic mixture, which is then input to the
decoder for omic A to complete the imputation ([294]Figure 1C). We
evaluated the imputation accuracy using the BMMC1 (gene+peak) and BMMC2
(gene+protein) datasets based on (1) 60/40 random split of training and
testing data with 500 repeats to get standard deviation estimate; (2)
training on all batches except for one batch and testing on the
held-out batch (leave-one-batch); (3) training on all cell types except
for one cell type and testing on the held-out cells of that cell type
(leave-one-cell-type).
Evaluation metrics
The batch effects correction and biological variance conservation
categories were used to assess the efficacy of the integration across
multiple modalities. To quantify bio-conservation, we used the Adjusted
Rand Index (ARI) and Normalized Mutual Information (NMI), and to
measure batch effect removal, we used k-nearest-neighbor batch-effect
test (kBET) and Graph Connectivity (GC). Specifically, ARI calculates
the degree of similarity between two clusterings and adjusts for the
possibility that objects can randomly form the same clusters. NMI
normalizes the mutual information to a scale of 0–1. While NMI excels
in unbalanced clustering or small clusters, ARI is better suited to
clusters of similar size.[295]^73 kBET performs hypothesis testing on
whether batch labels are distributed differently across cells based on
Pearson’s
[MATH: χ2 :MATH]
test.[296]^19 GC measures whether cells of the same type from different
batches are close to one another by computing a K nearest-neighbour
graph based on the distance between cells in the embedding
space.[297]^20
Quantification and statistical analysis
Linking genes to open chromatin regions
We sought to investigate the relation between the top peaks and top
genes under the same moETM topic (i.e.,
[MATH: φk(m)
=αkρ(m) :MATH]
for topic k and
[MATH: m∈{ge
ne,peak<
/mrow>} :MATH]
). To assess the in-cis relation, we measured the genomic distances
between genes and peaks and designated genes that were near peaks as
peak-neighboring-genes if they are within 150K base pairs (bp)
distance. Specifically, we first obtained a genes-by-topics matrix
[MATH: φ(gene)<
/mrow>=αρ(gene)<
/mrow> :MATH]
and a peaks-by-topics matrix
[MATH: φ(peak)<
/mrow>=αρ(peak)<
/mrow> :MATH]
.
To transform
[MATH: φ(peak)<
/mrow> :MATH]
into a peak_to_genes-by-topics matrix
[MATH: φ(peaks
_to_genes)<
/mrow> :MATH]
, we first derived a binary peaks-to-genes mapping matrix
[MATH: H :MATH]
with the entries
[MATH:
hp,g=1 :MATH]
if the corresponding pair of peak p and gene g are within 150K bp
genomic distance and are positively correlated and 0 otherwise. In
detail, we computed the Pearson correlation between gene g and peak p
in terms of their topic scores:
[MATH:
rp,g=(φg(gene<
mo>)−φ¯g(gene))⊤(φp(peak<
mo>)−φ¯p(peak))‖φg(gene<
mo>)−φ¯g(gene)‖2<
/msub>‖φp(peak<
mo>)−φ¯p(peak))‖<
mn>2 :MATH]
The genome distance between peaks and genes was based on the latest
genome build (i.e., hg38 for human) and obtained via the
GenomicRanges[298]^74 package in R.
Pathway enrichment analysis
For each moETM topic, we performed Gene Set Enrichment Analysis
(GSEA)[299]^37 to associate the topic with known pathways or gene sets.
In particular, we used each topic to query two gene sets from Molecular
signatures database (MSigDB), which are the 5219 Immunologic signature
gene sets (C7) and the 7763 Gene Ontology Biological Processes (BP)
(C5-BP) terms. For each topic, we ran GSEAPreranked on a ranked list of
genes based on their corresponding topic scores against every gene set
from C7 or C5-BP, and calculated the enrichment score (ES) for over- or
under-representation. The statistical significance of the ES was
computed based on 1000 permutation test. The gene sets with
Benjamini–Hochberg (BH) corrected p values lower than 0.05 were deemed
significant. Similarly, for the scATAC-seq data, the peaks-by-topics
matrix was first converted into a peaks_to_genes-by-topics matrix and
then provide as input to GSEA pipeline.
Motif enrichment analysis of top peaks from moETM-learned topics
To detect sequence-based regulatory elements for the cell-type-specific
topics, we performed motif enrichment analysis using the top 100 peaks
that exhibit the highest topic scores under each topic. The 100
sequences corresponding to those top 100 peaks under each topic were
extracted from Ensembl database and provided as input to the Simple
Enrichment Analysis (SEA) pipeline[300]^75 from the MEME suite.[301]^76
SEA utilizes the STREME motif discovery algorithm[302]^77 to identify
known motifs that are enriched in input sequences. For our purpose, we
used the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) Human
(v11) and HOCOMOCO Mouse (v11) motif database.[303]^78 Motifs with
Fisher’s exact test p values lower than 0.05 were selected as the
enriched motifs.
Differential analysis to detect condition-specific topics
We sought to detect moETM-topics that exhibit significantly higher
scores for the conditions of interest such as cell types or phenotypes.
Notably, while the cell types were at the single-cell level, the
phenotypes were at the subject level (e.g., COVID-19 severity state).
The latter means that the cells from the same subject were assigned the
same phenotype label. For each dataset, we first split the cells into
positive and negative groups, corresponding to the presence and absence
of the target condition, respectively. For each topic, we assessed the
statistical significance of the topic score increase for the positive
group relative to the negative group based on one-sided Student’s t
test. The topics with a Bonferroni-adjusted p value smaller than 0.001
were considered significant with the label.
Incorporating pathway-informed gene embeddings
In the linear decoder, we reconstruct the cells-by-features matrix by
the dot product of the 3 matrices, namely cells-by-topics,
topics-by-embedding, and embedding-by-features. By default, the last
feature embedding matrix consist of learnable parameters. However, we
can instill prior pathway information during the training of moETM by
fixing the features embedding to a known gene set. As a result, the
topics-by-embedding and embedding-by-features matrices change to
topics-by-gene_sets and gene_sets-by-features with only the
topics-by-gene_sets as the learnable parameters. This allows us to
directly map each topic to each gene set, which may further improve the
model interpretability especially if the chosen gene sets were highly
relevant to the data. Given that several single-cell multi-omic
datasets used in this study were derived from the blood, we utilized
the Immunologic signature gene sets collection (C7) from the MSigDB
database. Gene sets with fewer than five or more than 1000 genes were
filtered out. We then converted the gene set information into a binary
gene_sets-by-genes matrix with 0 and 1 indicating the absence and
presence of the genes (columns) in the corresponding gene set (rows),
respectively. We focused on the gene+peak case by fixing the gene
embedding to the gene set while learning the peak embedding as in the
default setting. We did not experiment this approach on the
gene+protein case, for which the topics learned by the default moETM
are sufficiently easy to interpret.
Acknowledgments