Abstract
Single-cell and single-nucleus RNA-sequencing (sxRNA-seq) is
increasingly being used to characterise the transcriptomic state of
cell types at homeostasis, during development and in disease. However,
this is a challenging task, as biological effects can be masked by
technical variation. Here, we present JOINTLY, an algorithm enabling
joint clustering of sxRNA-seq datasets across batches. JOINTLY performs
on par or better than state-of-the-art batch integration methods in
clustering tasks and outperforms other intrinsically interpretable
methods. We demonstrate that JOINTLY is robust against over-correction
while retaining subtle cell state differences between biological
conditions and highlight how the interpretation of JOINTLY can be used
to annotate cell types and identify active signalling programs across
cell types and pseudo-time. Finally, we use JOINTLY to construct a
reference atlas of white adipose tissue (WATLAS), an expandable and
comprehensive community resource, in which we describe four adipocyte
subpopulations and map compositional changes in obesity and between
depots.
Subject terms: Data integration, Statistical methods, Software,
Functional clustering, Transcriptomics
__________________________________________________________________
Batch integration is a critical yet challenging step in many
single-cell RNA-seq analysis workflows. Here, authors present JOINTLY,
a hybrid linear and non-linear NMF-based algorithm, providing
interpretable and robust cell clustering against over-integration.
Introduction
Single-cell and single-nucleus RNA-sequencing (sxRNA-seq) has immense
potential to enhance our understanding of human biology during
homeostasis and how development or diseases shape our cells, tissues,
and organs. To relate gene expression programs in specific cell types
or cell states to a disease or developmental state, it is required that
batch effects, which are technical sources of variation, between
samples are removed, as they can otherwise introduce false or mask true
associations.
In recent years, several methods have attempted to overcome the problem
of batch effects by integrating sxRNA-seq datasets using graph-based
(e.g., fastMNN^[26]1), statistical (e.g., Harmony^[27]2), or deep
learning-based (e.g., scVI^[28]3) approaches. However, it remains a
challenging problem. A recent comprehensive benchmark of batch
integration methods by the Theis group found that each method has a
different balance between conserving biological variation and removing
batch effects and that this balance can depend on the integration
task^[29]4. In addition to conserving different amounts of biological
variation, each method also has different and task-specific sensitivity
to over-correction, where biological variation, rather than batch
effects, are removed^[30]5,[31]6. Interpretable batch integration
methods can aid in evaluating the integration performance in a specific
task by enabling the user to evaluate whether the genes or gene
modules, driving integration are meaningful in the biological context.
There are already interpretable batch integration methods, such as
LIGER^[32]7, which is based on non-negative matrix factorisation-based
(NMF). LIGER learns shared (or dataset-specific) factors that in linear
combination can describe each cell, and these factors are interpretable
as they are defined by non-negative linear combinations of genes. Thus,
the factors, that can be used for clustering, also represent weighted
gene modules that allow the user to evaluate the integration and can
assist in functional annotation of datasets. However, in benchmarks,
such linear methods are not efficient at removing batch effects^[33]4
likely because batch effects can be highly non-linear^[34]8.
Here, we introduce JOINTLY, a hybrid linear and non-linear NMF-based
joint clustering tool. We benchmark JOINTLY and eight other batch
integration methods in five different integration tasks composed of a
total of 52 datasets. JOINTLY achieves state-of-the-art performance and
additionally generates interpretable factors. We show that the genes
associated with the interpretable factors are more specific for cell
types than traditional marker genes and that they can be used to guide
cell type annotation and discover active biological processes across
cell types and pseudo-time. We evaluate the robustness against
over-correction by integrating multi-donor datasets from different
tissues and find that JOINTLY removes within-tissue batch effects but
retains across-tissue biological variability. Finally, we demonstrate
how JOINTLY can be used to create a tissue atlas by clustering and
labelling cell types and states in white adipose tissue from six
different studies. Based on these high-quality labels, we create a
reference atlas of white adipose tissue (WATLAS) deeply characterising
the transcriptome of 43 cell types and states. The WATLAS is a
community resource, which is a source for hypothesis generation, for
contextualising new datasets through co-embedding and cell type and
state annotation using transfer learning, as well as for use as a
reference for deconvolution. Analysis of WATLAS revealed compositional
differences between lean and obese donors and between different white
adipose tissue depots, which we support by deconvoluting bulk
RNA-sequencing samples from approximately 1300 additional donors.
Results
JOINTLY identifies joint clusters and shared gene modules
There are at least two major sources of variation in multi-sample
single-cell and single-nucleus RNA-seq (sxRNA-seq), namely biological
signals and batch effects. The objective of our method, named JOINTLY,
is to exclusively capture the biological signals for cell clustering
(Fig. [35]1A). To achieve this, JOINTLY employs a hybrid framework of
linear and non-linear non-negative matrix factorisation. This framework
optimises three distinct low-rank matrices through multiplicative
updating. The non-linear component of JOINTLY decomposes the gene
expression matrix using consensus PCA (see “Methods”) into a reduced
dimensional space that effectively captures the shared variance between
batches, while also accounting for residual dataset-specific variance.
For each dataset, this reduced dimensional space is used to estimate
local cell–cell distances using an adaptive heat-based kernel^[36]9.
This kernel is factorised using kernel non-negative matrix
factorisation^[37]10 into two low-rank matrices (factor by cell and
cell by factor) regularised by shared nearest neighbours graphs. The
third and final low-rank matrix is a linear feature matrix (factor by
gene), which is minimised to generalise across all batches. This
ensures that learned factors, which describe the gene expression space
in each dataset, are characterised by the same genes across batches
facilitating both joint clustering and interpretation.
Fig. 1. JOINTLY clusters single-cell RNA-seq and single-nucleus RNA-seq
(sxRNA-seq) datasets without explicit integration.
[38]Fig. 1
[39]Open in a new tab
A Schematic illustration of JOINTLY. JOINTLY uses a hybrid linear and
non-linear non-negative matrix factorisation to optimise reduced
dimensional spaces, which allows for clustering across datasets and
interpretation to discover conserved and active biological processes.
The gene expression matrix (X) for each dataset (d in D) is decomposed
into lower rank matrices (H and F) in a mapped higher or infinitely
dimensional space (
[MATH: Φ :MATH]
). Graph regularisation is applied using the graph Laplacian L.
Interpretable factors V constrain JOINTLY to patterns that are
generalisable across datasets.
[MATH: α :MATH]
,
[MATH: β :MATH]
and
[MATH: λ :MATH]
weigh components of the loss function. Subject to (s.t.) non-negativity
constraints on the F and H matrix. Tr is the trace matrix. I is the
identity matrix. B, C UMAP for 8 cell lines^[40]11 with simulated batch
effects (see “Methods”) based on reduced dimensional spaces calculated
using PCA or JOINTLY. The UMAPs are coloured by batch, cell line, and
cluster (B) or by selected gene modules derived from the interpretation
JOINTLY (C). Source data are provided as a Source Data file.
To evaluate the ability of JOINTLY to cluster scRNA-seq datasets with
batch effects, we initially analysed a simple dataset consisting of
eight purified cell lines^[41]11. We randomly split the dataset into
two batches and simulated non-linear and cell type-dependent batch
effects (see “Methods”). The clusters identified using JOINTLY recover
the original cell types, whereas clustering on the unintegrated dataset
wrongly assigned one batch of Ramos cells to the Jurkat cluster
(Fig. [42]1B). In addition, JOINTLY also identifies interpretable
factors that represent gene modules driving clustering. These modules
are highly cell line-specific containing genes relevant to cellular
function (Fig. [43]1C, Supplementary Fig. [44]S1A). For example,
Decorin (DCN), a fibroblast marker gene, is contributing to the module
highly specific for fibroblast-derived IMR90 cells, while Cluster of
Differentiation 3B (CD3B), a T-cell coreceptor, is contributing to the
module highly specific for T-cell derived Jurkat cells. The gene
modules are generally not strongly correlated to each other within a
single batch, but highly correlated across batches (Supplementary
Fig. [45]S1B) suggesting that the modules represent different but
reproducible gene programs. To assess how these modules are affected by
the presence of batch effects, we analysed the same datasets without
the simulated batch effects and compared the detected gene modules. We
found that the modules in the datasets with and without simulated batch
effects are highly conserved (Supplementary Fig. [46]S1C) indicating
that JOINTLY discovers reproducible gene programs independent of batch
effects.
JOINTLY achieves state-of-the-art clustering performance
To compare the joint clustering performance of JOINTLY to existing
methods, we applied JOINTLY and eight state-of-the-art integration
methods^[47]1–[48]3,[49]7,[50]12–[51]14 to a sxRNA-seq dataset from the
human lung containing ~10,000 cells distributed over six batches^[52]4.
We reannotated the dataset automatically using a similar public
dataset^[53]15 and support vector classification (see “Methods”), which
in recent tests has been shown to have very good performance for label
transfer^[54]16,[55]17. We chose to reannotate the data, rather than
using labels defined by the original authors, to avoid any potential
biases in favour of the clustering and batch integration methods used
in the original publication. In this dataset, several methods,
including JOINTLY, reach an adjusted rand index (ARI) in the range of
0.89−0.91 (Fig. [56]2A). Notably, among the tested interpretable
methods, only JOINTLY reaches this state-of-the-art performance level.
No method achieves perfect performance since they all fail to separate
endothelial subtypes (blood vessel, vein, and lung microvascular), as
well as myeloid cell types (classical monocytes, macrophages, and
dendritic cells), but do correctly group them.
Fig. 2. JOINTLY performs on par with competing batch integration and
clustering methods.
[57]Fig. 2
[58]Open in a new tab
A UMAP based on embedded spaces calculated by the indicated methods and
coloured by transferred cell type labels and clusters in the Lung
dataset, as indicated in the figure. Heatmap showing ranks aggregated
across all datasets (global) and for the worst dataset (worst) based on
the adjusted rand index (ARI) between clusters and cell types (B) as
well as ranks based on batch and cell type separation and mixing
evaluated using cell type local inverse Simpson’s index (cLISI), cell
type average silhouette width (cASW), integration LISI (iLISI) and
batch ASW (bASW) (C) for the indicated methods and datasets. Line
charts showing ARI (D), iLiSI (E), bASW (F), cLISI (G), and cASW (H)
for the full dataset (zero datasets removed), and the average across
all possible subsamples removing between one and three datasets from
the Lung dataset where
[MATH: n0
:MATH]
= 5,
[MATH: n1
:MATH]
= 5,
[MATH: n2
:MATH]
= 10 and
[MATH: n3
:MATH]
= 10 unique sample combinations. Error bars show the standard error of
the mean. I Line chart showing the average fraction of conserved
neighbours. For the full dataset (zero datasets removed), the average
was calculated from the fraction of conserved neighbours between
different runs of JOINTLY on the full dataset and a reference run on
the full dataset where
[MATH: n0
:MATH]
= 5. For subsamples (between one and three datasets removed), the
average was calculated from the fraction of conserved neighbours
between the integration of the subsampled dataset and the reference run
on the full dataset, where
[MATH: n1
:MATH]
= 5,
[MATH: n2
:MATH]
= 10 and
[MATH: n3
:MATH]
= 10 unique sample combinations. Error bars show the standard error of
the mean. Source data are provided as a Source Data file.
Next, we expanded the benchmark by including an additional 4
integration tasks containing between 8444 and 46,993 cells distributed
over five to 18 batches^[59]4,[60]18–[61]21 with labels transferred
from similar datasets^[62]4,[63]20–[64]24. Summarised across all tasks,
JOINTLY ranks second only superseded by scVI, while the other
interpretable methods rank eighth and ninth, respectively (Fig. [65]2B,
Supplementary Fig. [66]S2A, see Source Data). In joint clustering, all
datasets may not be equally well clustered, or some datasets may
dominate the integration due to for example a difference in the number
of cells. To assess if there is a performance imbalance between
datasets, we evaluated the clustering performance in each dataset after
joint clustering. Together with scVI and Scanorama, JOINTLY ranks first
in the worst dataset of each task (Fig. [67]2B) showing that JOINLY has
a highly balanced performance across datasets. Taken together, this
demonstrates that JONTLY effectively captures latent gene expression
patterns, which are related to cellular identity, and generalises well
across all datasets.
In addition to clustering performance, we also evaluated batch and cell
type mixing using local inverse Simpson’s index^[68]2 (LISI) and the
average silhouette width (ASW) (Fig. [69]2C). The cell type LISI
(cLISI) assesses cell type mixing; the number of closest neighbours for
each cell that are of a different cell type. All methods perform
approximately equally well, including naïve unintegrated analysis. The
cell type ASW (cASW) assesses how separated cell types are; the average
distance to another cell of the same cell type compared to the average
distance to another cell of a different type. JOINTLY ranks fifth
followed by scVI and Scanorama. The integration LISI (iLISI) assesses
batch mixing; the number of closest neighbours for each cell that is of
a different batch. JOINTLY ranks seventh after scVI. Finally, batch ASW
(bASW) assesses how separated batches are; the average distance to
another cell of the same cell type in the same batch compared to the
average distance to another cell of the same cell type in a different
batch. For this metric, JOINTLY ranks first followed by LIGER and then
scVI, scGPT, and Scanorama. In summary, JOINTLY achieves
state-of-the-art joint clustering performance and has similar
trade-offs as scVI and Scanorama in terms of cell type and batch
separation and mixing.
One of the differences between JOINTLY and other methods is that
JOINTLY, by default, uses consensus PCA (see “Methods”) as a basis for
integrating samples. Therefore, the initial decomposition of each
sample depends on the other samples. Thus, integration performance is
dependent on the number and similarity of input samples. To evaluate
the extent of this dependency, we removed all sample combinations from
the human liver dataset resulting in 25 subsamples with two to four
batches. We found no difference in the ARI, in the average cell type
and integration LISI nor the cell type and batch ASW for subsamples
with two to four batches compared to the full dataset with five batches
(Fig. [70]2D–H, Supplementary Fig. [71]2B), although we did find that
with fewer batches, the standard error of the mean increases,
indicating that integration performance becomes more variable. We also
evaluated how well the nearest neighbours are conserved between each
subsample and the full dataset and found that approximately 50–60% of
the nearest neighbours are conserved with a minor decrease with fewer
batches, which is similar in range to the variation between different
integrations on the full dataset (Fig. [72]2I). Collectively, this
indicates that the performance of JOINTLY is not strongly dependent on
the number of input datasets.
JOINTLY retains biological variation across conditions
An important use case for joint clustering is multi-sample,
multi-condition datasets, such as studies comparing several healthy and
diseased individuals. In such a dataset, the aim is to make inferences
on the level of conditions, while controlling for sample-to-sample
variation. It is especially challenging to perform joint clustering in
this type of dataset since both batch effects and biological
variability between conditions contribute to intra-sample variability.
Thus, we cannot assume a priori that all samples should overlap. To
evaluate the ability of JOINTLY and alternative methods to retain
biological variability between conditions, while removing batch effects
between samples, we created 10 multi-sample, multi-condition datasets
by pairwise combination of the 5 datasets used for benchmarking. For
each integration method, we evaluated the integration and cell type
LISI for each tissue, as well as the integration LISI across tissues.
We found JOINTLY consistently achieves a high overall ranking
(Fig. [73]3A, see Source Data). Generally, JOINTLY balances the
preservation of biological variation, while removing batch effects
within each tissue, with obtaining a good separation across tissues. As
an example, we analysed the mixture of Pancreas and Lung in more detail
and found upon visualising the data, that all methods, except LIGER and
Seurat, successfully keep the majority of the two datasets unmixed. Of
the remaining methods, JOINTLY, Scanorama, Harmony, and scVI obtain
good integration of the samples, while separating within-tissue cell
types (Fig. [74]3B and Supplementary Fig. [75]S3A, B). Among these four
methods, we found subtle differences. For example, Harmony mixes the
endothelial cells across tissues, while JOINTLY, scVI, and Scanorama
place them close to each other in embedded space but retain
tissue-specific clusters. This prompted us to investigate the
similarity of endothelial cells between tissues. Initially, we
identified shared marker genes and found that the shared marker genes
contain several known endothelial markers, such as lymphatic vessel
endothelial hyaluronan receptor 1 (LYVE1) and melanoma cell adhesion
molecule (MCAM) (Fig. [76]3C) highlighting that both populations are
bona fide endothelial cells. However, differential expression analysis
between the two populations revealed a large set of 3529 differentially
expressed genes that are similarly expressed across samples within each
tissue (Fig. [77]3D). We submitted lung- and pancreas-specific
endothelial markers to enrichment analysis using a database of
transcription factor targets and found an almost complete dichotomy
(Fig. [78]3E) suggesting that distinct transcriptional programs and
regulatory mechanisms are shaping endothelial cells from the two
tissues. As an example of different biological processes shaping the
cells, we found that interferon signalling is specifically high in the
lung-derived endothelial cells (Fig. [79]3F). Taken together, this
strongly suggests that lung-derived and pancreas-derived endothelial
cells represent different states of the same cell type. JOINTLY
correctly retains this variability indicating that JOINTLY is robust
against deletion of condition-specific variation in multi-sample,
multi-condition datasets.
Fig. 3. JOINTLY is robust to cell state deletion.
[80]Fig. 3
[81]Open in a new tab
A Heatmap showing ranked integration metrics (iLISI and cLISI within
each tissue and over-correction, which is iLISI between tissues) for
mixtures of the indicated tissues. B UMAP based on embedded spaces
calculated by the indicated methods and coloured by transferred cell
type label (see “Methods”) for the combined Lung and Pancreas dataset.
C Heatmap showing normalised and scaled per-donor pseudo-bulk
expression of marker genes shared between pancreas-derived and
lung-derived endothelial cells. D Heatmap showing normalised and scaled
per-donor pseudo-bulk expression of differentially expressed genes
between pancreas-derived and lung-derived endothelial cells. E
Scatterplot showing the -log[10] Benjamini-Hochberg-corrected p-values
of Enrichr fisher’s exact test of the enrichment of transcription
factor target gene among differentially expressed genes between
pancreas-derived and lung-derived endothelial cells. F Heatmap showing
normalised and scaled per-donor pseudo-bulk expression of interferon
signalling genes in pancreas-derived and lung-derived endothelial
cells. Source data are provided as a Source Data file.
Interpretable factors augment cell type annotation and uncover active
biological processes
Both JOINTLY and scVI achieve state-of-the-art performance in joint
clustering and are both robust against over-correction. However, one
defining difference between the two approaches is that JOINTLY is
intrinsically interpretable. To evaluate the value of JOINTLY’s
interpretable factors, we reanalysed a dataset from human white adipose
tissue subset to whole tissue single-nucleus RNA-seq on visceral
adipose tissue from nine donors^[82]25 using JOINTLY to integrate
across donors and different visceral adipose tissue sources
(Supplementary Fig. [83]4A, B). We identified 13 unique cell types
(Fig. [84]4A), each of which is enriched for known marker genes
(Fig. [85]4B) and overlaps the cell types identified by the original
authors (Supplementary Fig. [86]4C). We scored the gene modules
identified by JOINTLY in each cell and found that the modules are
enriched in one cell type or a group of related types (Fig. [87]4C, see
Source Data). Inspecting the modules, we found that they contain many
genes associated with cellular identity. For example, factor 13, which
is specific to adipocytes, contains genes such as cluster of
differentiation 36 (CD36), diacylglycerol O-acyltransferase 2 (DGAT2),
lipoprotein lipase (LPL), perilipin 5 (PLIN5), adiponectin (ADIPOQ),
acyl-CoA synthetase long chain 1 (ACSL1), and peroxisome
proliferator-activated receptor γ (PPARG), and factor 7, which is
specific to lymphatic endothelial cells (LEC), contain genes such as
LYVE1, reelin (RELN), multimerin 1 (MMRN1), and neuropilin 2 (NRP2). To
test if the modules and associated genes can be used for cell type
annotation, we performed gene set enrichment analysis using the modules
and databases of cell type-specific marker genes and found that for
several cell types, the correct label was identified in the top three
best hits per factor (Fig. [88]4C, right annotation, see Source Data).
Fig. 4. Interpretable factors uncover active biological processes.
[89]Fig. 4
[90]Open in a new tab
A UMAP based on JOINTLY embeddings of 73,118 adipose tissue cells from
9 batches coloured by cell type annotations^[91]25. Cells were
annotated based on marker gene expression. B Dot plot showing
expression levels and frequency for selected marker genes for each
identified cell type. C Heatmap showing average module scores using
gene modules derived from JOINTLY factors across the identified cell
types. Annotations on the right show the top 3 most enriched cell types
identified by pathway analysis. Correct matches between enrichment
results and manual curation are highlighted in bold. D Boxplots showing
area under the curve (AUC) for genes identified through interpretation
of JOINTLY or through differential gene expression analysis between
cell types. The centre represents the median AUC across the 13 cell
types and whiskers indicate 1.5 times the interquartile range above or
below the 75% and 25% quantiles, respectively. E Boxplot showing the
fraction of marker genes that are identified as marker genes in at
least two batches stratified by whether the marker gene is also
identified through interpretation of JOINTLY. The centre represents the
median fraction across the 13 cell types and whiskers indicate 1.5
times the interquartile range above or below the 75% and 25% quantiles,
respectively. F UMAP based on JOINTLY embeddings of 73,118 adipose
tissue cells from 9 batches coloured by module score of genes assigned
to JOINTLY factor 10. G Barplot showing the -log[10] FDR-corrected
p-values from Enrichr fisher’s exact test of selected pathways that are
enriched for genes assigned to JOINTLY factor 10. H Boxplot showing
module scores for genes assigned to JOINTLY factor 10 across adipocytes
and FAP stratified by donor sex. The centre represents the mean and
whiskers indicate 1.5 times the interquartile range above or below the
75% and 25% quantiles, respectively. I UMAP based on JOINTLY embeddings
of cells undergoing erythropoiesis during mouse gastrulation^[92]28
from 3 batches coloured by cell type annotations. J Heatmap showing the
average module score for all JOINTLY factors across cell bins. Cells
were binned in 50 groups based on pseudo-time (calculated using
scVelo^[93]27). Modules were grouped based on their temporal profile. K
Barplots showing the combined score of selected pathways identified by
pathway enrichment analysis of groups of modules (see J). Source data
are provided as a Source Data file.
We compared the modules identified by JOINTLY to the list of marker
genes for each cell type and found a high overlap between modules and
marker genes (Supplementary Fig. [94]4D). However, we found that the
marker genes, which are also module genes, have a significantly higher
area under the curve (AUC) than marker genes, which are not in the most
enriched module (Fig. [95]4D), indicating that module genes are more
discriminatory between cell types than marker genes. Similarly, we
found that module genes, which are not marker genes, have a
significantly lower AUC than genes, which are neither marker nor module
genes (Fig. [96]4D) indicating these module genes are markers of other
cell types. Finally, we evaluated the consistency of markers across
batches by finding marker genes in each batch independently and found
that marker genes, which are also module genes, are more often markers
in multiple batches compared to marker genes, which are not module
genes (Fig. [97]4E). Collectively, this indicates that module genes are
highly discriminatory between cell types, and more so than marker
genes.
In addition to cell type-specific modules, we also observed shared gene
modules with non-uniform distribution within enriched cell types. For
example, factor 10, which scores high in a subset of adipocytes and
fibro-adipogenic progenitors (FAPs) (Fig. [98]4F). We hypothesised that
such factors may represent distinct transcriptional states or programs.
Pathway analysis revealed that factor 10 contains genes involved in
insulin signalling and sensitivity (Fig. [99]4G) suggesting that this
factor marks cells with high insulin signalling capacity. Stratifying
adipocytes and FAPs by sex revealed that factor 10 scores high
specifically in females indicating that females may have higher adipose
tissue insulin sensitivity than men consistent with existing
literature^[100]26 (Fig. [101]4H).
In addition to discrete cell types, sxRNA-seq can also be used to probe
continuous biological processes, such as differentiation. To test
JOINTLY in this setting, we applied JOINTLY and scVelo^[102]27 to three
batches of cells undergoing erythropoiesis from an atlas of mouse
gastrulation^[103]28. We found that JOINTLY orders cells along cell
types and along latent time in a comparable manner to scVelo
(Fig. [104]4I, Supplementary Fig. [105]4E). Interpretation of JOINTLY
revealed that the JOINTLY factors have different temporal profiles
(Fig. [106]4J). Pathway analysis of genes associated with early,
temporary, early-mid, late-mid, or late latent time revealed signalling
cascades with different timing, consistent with existing literature on
erythropoiesis (Fig. [107]4K). Collectively, this suggests that the
interpretable factors learned by JOINTLY can be used to guide the
analyst toward annotating their datasets, recovering temporal dynamics,
and discovering new biological insights.
Building a white adipose tissue reference atlas using JOINTLY
Finally, having established that JOINTLY achieves high clustering
performance, we set out to generate a reference atlas for white adipose
tissue as a vignette of a use-case of JOINTLY to generate a community
resource. Integration of tissue atlases has different requirements than
integration of multiple samples in a single dataset. The datasets used
to construct tissue atlases often are more heterogeneous and have
stronger batch effects, as the datasets are often generated using a
multitude of technologies across several different laboratories.
Furthermore, a tissue atlas is often aimed towards being a resource for
a community, and it is therefore beneficial to create tissue atlases
using methods with transfer learning capabilities. This enables the
atlas to grow as new data is added, and it allows researchers to use
the atlas to contextualise their data without sharing raw data. Recent
benchmarking efforts have shown that semi-supervised integration using
scANVI is among the best-performing methods in this space^[108]4.
However, to apply semi-supervised integration and build a
state-of-the-art tissue atlas, cell type labels are required.
We uniformly annotated cells from white adipose tissue from six
independent sxRNA-seq studies^[109]15,[110]25,[111]29–[112]32 by
applying JOINTLY to each study separately (Fig. [113]5A). Subsequently,
we created an expandable tissue atlas, termed WATLAS, by integration
using scANVI^[114]33, totalling ~300,000 cells from both visceral (VAT)
and subcutaneous adipose tissue (SAT) depots. In the WATLAS, we were
able to recover 17 major cell types in WAT, and 43 total subtypes
(Fig. [115]5B, Supplementary Fig. [116]5A, B). Our labels for cell
types and subtypes are highly consistent with the original author
labels (Supplementary Fig. [117]5C) and all cell types and subtypes are
supported by robust gene signatures (Supplementary Fig. [118]S5D, see
Source Data). We have made the atlas, annotations, and metadata openly
and freely available^[119]34 as well as the model weights for transfer
learning^[120]35. To highlight the richness of this resource for
hypothesis generation, we investigated how obesity affects the
composition of the white adipose tissue. Controlling for sex and depot,
we found that obesity is associated with a trend towards a decrease in
the fractional number of smooth muscle cells and fibro-adipogenic
progenitors (FAPs), driven by CXCL14^+ and PPARG^+ FAPs, and a trend
towards an increase in LPL^+ and LYVE1^+ macrophages as well as all
subtypes of adipocytes, except PRSS23^+ adipocytes (Fig. [121]5C). An
increase in adipocytes and a decrease in FAPs could suggest an
increased rate of adipogenesis in obesity in humans, similar to what we
have previously reported in obese mice^[122]36.
Fig. 5. Building a white adipose tissue atlas with JOINTLY.
[123]Fig. 5
[124]Open in a new tab
A Illustration of the datasets and workflow used to generate the white
adipose tissue atlas (WATLAS). The colour of the characters indicates
the donor sex (green = male, red = female), visceral adipose tissue
(VAT) and subcutaneous adipose tissue (SAT) indicates the depot origin,
stromal vascular fraction (SVF) and whole tissue (WT) indicate the cell
fractions used. * Indicate a fraction was enriched for. B UMAP based on
scANVI^[125]33 embeddings coloured by cell type labels. Cells were
annotated based on marker genes and clustering of each individual
dataset with JOINTLY. Subsequently, scANVI was trained using the
JOINTLY labels. C Scatterplot showing compositional changes in cell
type abundances between individuals with or without obesity, corrected
for depot and sex in 46 donors. Broad cell types in the left panel, and
states of highlighted cell types in the right panel. Centres represent
the mean effect size and error bars represent the 95% confidence
intervals. D Heatmap showing normalised and standardised average
expression of adipocyte markers in adipocyte subpopulations and all
other cell types stratified by the depot. E Heatmap showing normalised
and standardised average module scores in adipocyte subpopulations
stratified by depot and study. Modules were generated using
differentially expressed genes in the indicated adipocyte
subpopulations. F Heatmap showing normalised and scaled per-donor
pseudo-bulk expression levels of differentially expressed genes between
depots stratified by adipocyte subtypes, study, and depot. G
Scatterplot showing compositional changes in cell type abundances
between adipose tissue depots, corrected for sex and weight status in
46 donors. Broad cell types in the left panel, and states of
highlighted cell types in the right panel. Centres represent the mean
effect size and error bars represent the 95% confidence intervals. H
Scatterplot showing compositional changes in cell type abundances
between adipose depots, corrected for sex using deconvoluted cell type
abundances from 1293 donors. Centres represent the mean effect size and
error bars represent the 95% confidence intervals. Source data are
provided as a Source Data file.
The number of transcriptional states of mature adipocytes is currently
being debated in the field with studies reporting between three and
seven subtypes^[126]25,[127]32,[128]37. Adding to this debate, a recent
integrated analysis of sxRNA-seq studies found low batch integration
for adipocytes and inconsistent enrichment of marker genes across
studies^[129]38 suggesting either low adipocyte heterogeneity or
inadequate power to detect it due to either technical or analytical
challenges. In the WATLAS, we annotated four different adipocyte
populations; DCN^+ adipocytes are the most distinctive population
marked by genes normally associated with expressed in fibro-adipogenic
progenitors, such as DCN, apolipoprotein D (APOD) and Lumican (LUM)
suggesting they may represent newly differentiated adipocytes. The
second most distinct population is the PRSS23^+ adipocytes, which is
similar to the hAd3 population defined by the Rosen group^[130]25 and
express several sulfotransferases, such as sulfotransferase family 2B
member 1 (SULT2B1) and glutamate NMDA receptor components, such as
glutamate ionotropic receptor NMDA type subunit 2A (GRIN2A) (see Source
Data). CLSNT2^+ adipocytes resemble the hAd5 population defined by the
Rosen group^[131]25 and express genes associated with insulin
signalling and sensitivity, such as ectonucleotide pyrophosphatase
phosphodiesterase 1 (ENPP1). Finally, DGAT2^+ adipocytes, which
resemble hAd4 as defined by the Rosen group^[132]25 express lipolysis
and lipogenesis genes, such as monoacylglycerol o-acyltransferase 1
(MOGAT1) and diacylglycerol o-acyltransferase 2 (DGAT2). All four
populations have variable, but high expression of established adipocyte
markers compared to non-adipocyte cells (Fig. [133]5D) suggesting that
they all represent bona fide adipocytes. We investigated the similarity
of gene expression programs across populations, depots, and studies by
identifying marker genes for each population and scoring the enrichment
of all marker gene modules in each cell. We found that modules are
enriched in the population of origin across studied and depots, except
for SAT-derived PRSS23^+ adipocytes from the Lumeng group, which has
stronger enrichment for a DGAT2^+ signature (Fig. [134]5E). The
expression patterns of individual marker genes, although less clear,
support that the four populations reported here replicate between
studies and across depots (Supplementary Fig. [135]6A).
However, there are depot-specific differences in the expression level
of established marker genes as well as in the enrichment of gene
modules suggesting that the depot shapes the gene expression profile of
the adipocytes. A deeper analysis of depot-specific gene programs in
adipocyte subpopulations revealed that depot differences are shared
between subpopulations and studies (Fig. [136]5F), although expression
is sparser in DCN^+ adipocytes compared to other adipocyte
subpopulations, which we attribute to the low number of DCN^+
adipocytes. Taken together, this indicates that the niche effects from
the depot affect all adipocyte populations similarly. However, we found
that different depots are characterised by different white adipose
tissue composition. VAT is enriched for mesothelial cells, macrophages
overall, and especially LPL^+ macrophages (that are similar to TREM2^+
macrophages), lymphatic endothelial cells (LECs), and endometrium,
whereas SAT is enriched for vascular endothelial cells (VECs), DPP4^+
and CXCL14^+ FAPs and several immune cell types (Fig. [137]5G, see
Source Data). To support these claims, we deconvoluted 562 VAT and 731
SAT bulk RNA-seq samples from GTEx at medium label resolution using
WATLAS and evaluated depot differences in adipose tissue composition,
excluding mesothelial cells, which are only present in VAT. The average
deconvoluted cell type fractions in both VAT and SAT was highly
correlated with the observed cell type fractions observed in the WATLAS
(Supplementary Fig. [138]6B) and several trends from the single-cell
dataset are conserved including enrichment of VEC, FAPs, and T cells in
SAT and enrichment of macrophages and LECs in VAT (Fig. [139]5H). These
observations are consistent with reports that find VAT has a higher
content of macrophages in mice^[140]39 and humans^[141]40,[142]41,
while SAT has been reported to have a higher content of adipocyte
progenitors in humans^[143]40 and a higher proportion of CD4^+T
cells^[144]42. In the literature, there are conflicting reports
concerning vascularisation, capillary density, and endothelial cells.
Some studies find an increase in VAT compared to SAT^[145]43–[146]45,
while other studies find no difference 45 or the opposite^[147]46. This
suggests that the difference in vascularity between depots depends on
the biological context.
Discussion
To deeply characterise the transcriptional state of cells in the human
body using single-cell or single-nucleus RNA-sequencing (sxRNA-seq),
and evaluate how diverse biological processes, such as development and
disease, alter those states, it is necessary to distinguish between
biological and technical variation. Here, we introduce JOINTLY, a
hybrid linear and non-linear matrix factorisation-based method for
joint clustering of sxRNA-seq. JOINTLY aims to capture shared features
across datasets without explicitly harmonising them. To that end,
JOINTLY defines a reduced dimensional space, using consensus PCA, which
is based on shared axes of variation supplemented with the
batch-specific axis of variation, that describes most of the variation
across all datasets. This choice is appropriate for datasets, where the
major axes of variation are assumed to be similar, such as biological
replicates. For more complex datasets, where all samples do not
necessarily share the same axes of variation, such as multi-condition,
multi-sample datasets, we advise users to use regular PCA in JOINTLY to
ensure that the most important axes of variation are captured. Cellular
similarity is estimated in this reduced space using a data-adaptive
heat-based kernel^[148]9 as well as a shared nearest neighbourhood
graph. These measures are used to learn a reduced dimensional space,
which reconstructs non-linear cell-to-cell similarity in each dataset,
as well as the linear gene expression space across datasets. This
allows for joint clustering of the data without explicitly assuming any
overlap between datasets, and it allows for the interpretation of the
model to discover genes’ contribution to clustering. We evaluate
JOINTLY and compare its performance to eight state-of-the-art batch
integration methods^[149]1–[150]3,[151]7,[152]12–[153]14 on 52
sxRNA-seq samples from five tissues. The chosen methods are widely
applied in the field, and are, like JOINTLY, unsupervised integration
methods. We chose to only include unsupervised methods, as systematic
benchmarking has shown that (semi)supervised integration methods often
outperform their unsupervised counterparts^[154]4. JOINTLY performs on
par with state-of-the-art batch integration tools, such as scVI^[155]3
and Harmony^[156]2, in clustering tasks and has a similar trade-off
between biological heterogeneity and batch mixing as scVI. In line with
a recent benchmark^[157]47, we found that JOINTLY and several
task-specific models, outperformed scGPT, a foundational single-cell
RNA-sequencing model. As a future perspective, we envision that the
performance of JOINTLY can be even further improved by initialising the
algorithm using cell type labels.
A critical and challenging task for joint clustering is
multi-condition, multi-sample datasets, where both batch effects, which
should be removed, and biological variation introduced by the
condition, which should be retained, contribute to intra-sample
variance. To assess how the different methods perform in this setting,
we generated an artificial multi-condition, multi-sample dataset by
mixing single-cell RNA-sequencing samples from the lung and the
pancreas. We found that JOINTLY, scVI, and Scanorama retain biological
variation across conditions, while removing batch effects, suggesting
that these methods are preferable for analysing multi-condition,
multi-sample datasets, such as cohort studies of healthy and diseased
individuals.
In our benchmark, we noted that although JOINTLY, scVI^[158]3, and
Harmony^[159]2 generally have the best performance, there is
substantial task-to-task variation. Across tasks, JOINTLY achieves
ranks between second and fifth, while both scVI and Harmony rank
between first and seventh. This indicates that no method is always the
best across all tasks, and therefore, that it is critical to evaluate
the performance of a chosen method on any new integration task. This
can be accomplished using various integration metrics, such as those
formalised in scIB^[160]4. However, such metrics do not provide insight
into the genes or features driving integration and clustering. By
design, JOINTLY is interpretable meaning that it learns which gene
modules contribute to integration and clustering. Analysis of these
gene modules gives the analyst an immediate insight into the dataset,
allowing the analyst to compare genes driving clustering to known
marker genes, thereby evaluating the biology of integration. Based on a
cell line dataset with simulated batch effects, we found that the
interpretable factors generated by JOINTLY are highly insensitive to
batch effects. To further evaluate the interpretable factors, we
processed a dataset from human visceral white adipose tissue consisting
of nine samples. We found that the interpretable factors can be used to
predict cell type labels and to re-discover a known sexual dimorphism
in insulin signalling in adipocytes and fibro-adipogenic progenitors.
Compared to regular cluster-based differential expression analysis, the
genes identified through interpretation of JOINTLY are more
discriminative, and more conserved between batches. An additional
advantage of the interpretable factors is that they not limited to
being enriched in a cluster or cell type but can be shared between
several cell types that share transcriptional programs, or specific to
certain conditions. This allows the analyst to discover active
biological programs that would have been missed by cluster-based DE
analysis. Taken together, this indicates that in addition to helping
the analyst evaluate clustering, the interpretable factors can also be
used to guide the functional annotation of a dataset and to discover
important active biological processes.
Finally, we use JOINTLY to create a white adipose tissue reference
atlas (WATLAS) by processing and labelling six independent datasets
with JOINTLY and subsequently, integrating across datasets using
scANVI, a semi-supervised integration method. This yields an expandable
atlas available as a resource for the community to explore^[161]34 and
use for contextualisation of new datasets with transfer
learning^[162]35. In the WATLAS, we identify 46 distinct cell states,
with unique and specific gene signatures. Our WATLAS covers a high
dynamic range of cell abundances, from highly abundant adipocytes to
ultra-rare populations, such as Schwann cells, which have not
previously been identified in individual scRNA-seq datasets, as well as
a plethora of subpopulations of major cell types. Unlike previous
studies, we find that adipocyte populations do share gene expression
programs. We characterise compositional changes between lean and obese
donors and between subcutaneous and visceral adipose tissue finding
remodelling of the immune environment and changes in the balance
between adipocytes and progenitor cells. We support the latter analysis
by performing computational deconvolution of 1293 bulk RNA-sequencing
samples from human subcutaneous and visceral adipose tissue depots.
These analyses highlight that the WATLAS can be used for deconvolution
with high accuracy and be used to discover compositional differences in
obesity, type 2 diabetes, or other adipose tissue-related diseases.
In summary, JOINTLY is an advanced computational tool for
characterising transcriptional states in sxRNA-seq data. By effectively
addressing biological and technical variations, our method enables
joint clustering and interpretation, facilitating the exploration of
diverse biological processes across multiple datasets. To facilitate
the use of JOINTLY, we have developed an R
([163]www.github.com/madsen-lab/rJOINTLY) package, that integrates with
common single-cell analysis frameworks, such as Seurat^[164]13.
Methods
JOINTLY
The default workflow for the integration of datasets using JOINTLY
involved six major steps:
1. Highly variable feature selection.
2. Normalisation and dimensional reduction.
3. Cell-cell similarity estimation using a kernel.
4. Optimising factors using hybrid linear and non-linear non-negative
matrix factorisation.
5. Identification of gene modules.
6. Clustering and visualisation.
As a default, JOINTLY automatically selects (default = 1000) highly
variable genes (HVGs) using a deviance-based measure^[165]48,[166]49
for each batch, and the union across all HVG sets is used for
downstream analyses. Users can use Seurat for HVG identification or
supply a user-defined list of HVGs.
In each batch, a size factor for each cell is calculated by dividing
the sum of counts by 10,000 similarly to Seurat^[167]13 or optionally
using scran^[168]50. Each cell is then normalised and transformed using
the logCPM method from Seurat^[169]13 or optionally the shifted log
transform^[170]51. The normalised and transformed counts are then
standardised and decomposed into a reduced dimensional space using the
selected HVGs. By default, JOINTLY uses consensus PCA for dimensional
reduction. In consensus PCA, we calculate the variance-covariance
matrix for each batch and calculate their sum weighted by the number of
cells in each batch. This within-group variance-covariance matrix is
decomposed using randomised singular value decomposition^[171]52 into
20 components. In each batch, the amount of variance explained by these
15 components relative to a dataset-specific decomposition using the
same number of components is evaluated. For batches where the common
decomposition explains less than 80% of the variance explained by the
dataset-specific decomposition, additional batch-specific components
are added (see Supplementary Note [172]1 for a mathematical definition
of consensus PCA).
The cell-cell similarity in each dataset is estimated using a
data-adaptive alpha-decay kernel^[173]9 based on the Euclidean distance
between cells in the consensus PCA (or user-supplied) reduced
dimensional space. The reduced dimensional space is also used to
calculate a shared nearest neighbour graph using Seurat^[174]13.
To optimise factors using hybrid linear and non-linear non-negative
matrix factorisation, JOINTLY minimised the following loss function:
[MATH:
argminF,
H=αΦXd−ΦXdFdHd+λTrHdLd
HdT
msubsup>+β∑d=1D∑<
/mo>j≠dVjHd−VdHds.t.Fd
mrow>≥0,Hd
mrow>≥0,Hd
mrow>HdT=I,d=1,2,…,D :MATH]
1
This approach aims to reconstruct the normalised and standardised count
matrix as well as the kernel space using the H (clustering matrix), F
(basis matrix of the kernel), and V (shared feature matrix) matrices.
The first term in the loss function, weighted by
[MATH: α :MATH]
, represents the reconstruction error within a single batch. Minimising
this term encourages the clustering matrix to reconstruct the kernel.
The second term, weighted by
[MATH: λ :MATH]
, applies graph regularisation to the H matrix. Minimising this term
encourages that cells close to each other in the consensus PCA space
also are close to each other in the clustering matrix. The last term,
weighted by
[MATH: β :MATH]
, represents the sum of differences in linear gene expression space
across the clustering matrix between batches. Minimising this term
encourages that the same genes contribute the same factors in the
clustering matrix across batches. By default, the H matrix is
initialised per batch using fuzzy clustering in the consensus PCA
space, and the F and V matrices using linear regression based on the H
matrix as well as the kernel and gene expression space, respectively.
Next, the loss function is minimised using multiplicative updating
algorithm^[175]53 (see Supplementary Note [176]2 for a mathematical
description of the algorithm) for 200 iterations. Finally, the H
matrices from each batch are concatenated and standardised per factor
and then per cell.
To identify and score gene modules, the V matrices are standardised per
factor and then per gene for each batch and then averaged across
batches. The averaged V matrix is standardised per factor and then per
gene. Genes are assigned to each factor by ordering the gene scores
within each factor and finding the inflection point of the score
distribution using the unit invariant knee method^[177]54. Genes with
scores higher than the inflection point are assigned to the factor
module and each factor module is scored in each cell using
UCell^[178]55.
For clustering and visualisation, the standardised H matrix is used as
input. For visualisation, UMAP coordinates were calculated using
Seurat^[179]13 and for clustering, we used hierarchical graph-based
clustering^[180]56, but any other clustering methods can be applied,
such as Louvain clustering.
Benchmarking
Datasets
Cell lines^[181]11: Counts were downloaded from Zenodo under
[10.5281/zenodo.3238275] and randomly split each cell type into two
batches of equal size. We simulated a complex batch effect by
introducing cell type-specific Poisson noise into the second batch. For
each cell type, we drew a maximum noise level at random from a normal
distribution with a mean and standard deviation of 0.5. Then we draw a
noise fraction for each gene from a uniform distribution between zero
and the maximum noise level for the cell type. Based on the average
gene count across cells in the cell type and batch and the noise
fraction, we calculate the mean of the noise distribution and randomly
draw noise counts from the Poisson distribution. Adding gene-wise noise
to gene-wise counts results in a new gene-wise Poisson distribution
with a shifted mean corresponding to complex cell type- and
gene-specific batch effects.
Lung^[182]4: Counts were downloaded from figshare under
[10.6084/m9.figshare.12420968.v8]. The dataset was subset to only
contain cells from control patients retaining 10,046 cells and 15,148
genes across six batches. The donor ID was used as a batch identifier,
and cell type labels were transferred using a complementary human lung
atlas^[183]15.
Pancreas^[184]19: Counts from 20,784 cells and 30,378 genes across 12
donors were downloaded from NCBI Gene Expression Omnibus under
accession code [185]GSE114297. The donor ID was used as a batch
identifier, and cell type labels were transferred using a complementary
human pancreas atlas^[186]24.
Kidney^[187]20: Counts were downloaded from cellxgene under accession
code [188]bcb61471-2a44-4d00-a0af-ff085512674c. The dataset was subset
to only contain cells from healthy living donors retaining 20,497 cells
and 27,305 genes across 18 batches. The donor ID was used as a batch
identifier, and cell type labels were transferred using the single
nucleus fraction from the same kidney atlas^[189]20.
Liver^[190]18: Counts from 8444 cells and 32,922 genes across five
donors were downloaded from cellxgene under accession code
[191]bd5230f4-cd76-4d35-9ee5-89b3e7475659. The donor ID was used as a
batch identifier, and cell type labels were transferred using a
complementary human liver atlas^[192]22.
PBMC^[193]21: Counts were downloaded from cellxgene under accession
code [194]03f821b4-87be-4ff4-b65a-b5fc00061da7. The dataset was subset
to only contain cells from healthy adult donors retaining 46,993 cells
and 33,193 genes across eleven batches. The donor ID was used as a
batch identifier, and cell type labels were transferred using
complementary human PBMC atlas^[195]23.
Label transfer
For all datasets, except the cell line dataset, we transferred labels
from similar public datasets as indicated above. Query and reference
datasets were normalised to 10,000 UMI counts per cell and transformed
using log(x+1). In the reference, differentially expressed (DE) genes
were identified in each cell type versus all other cells with
Scanpy^[196]57 using Wilcoxon rank sum with log fold changes > 3. Gene
sets are pruned, to remove genes that are in the first percentile
highest DE in any other cell type. The top 50 genes are kept per cell
type. Genes lists are further pruned for genes not expressed in the
query dataset. We then performed two rounds of label transfer. In the
first round, we trained a one-class support vector machine per cell
type in the reference dataset using sklearn (linear kernel and nu =
0.5). Since probabilities in SVMs are uncalibrated, we identified
classification optimal thresholds per cell type by maximising the
geometric mean between the sensitivity and the specificity on the
reference data and used these to label cells in the query dataset.
Cells labelled as multiple cell types were labelled as unknown, as were
cells not labelled as any cell type. In the second round of labelling,
we used the labelled query cells to train a linear support vector
classifier using sklearn (maximum of 1000 iterations with balanced
class weights). Again, we identified classification optimal thresholds
per cell type by maximising the geometric mean between the sensitivity
and the specificity and obtained the final labels for all cells in the
query dataset, and unlabelled or ambiguously labelled cells were
removed from the dataset.
Integration parameters
JOINTLY was run with default parameters on all datasets, except for the
cell line dataset, which was run using 15 components.
scVI^[197]3 parameters were obtained from online tutorials:^[198]58 The
dataset was subset to highly variable genes selected by JOINTLY and
integrated based on raw counts running 500 epochs using early stopping
with patience of 10. The integration was evaluated and visualised using
10 dimensions.
Harmony^[199]2 parameters were obtained from online tutorials:^[200]59
The dataset was normalised, scaled, and dimensionally reduced using
Seurat based on highly variable genes selected by JOINTLY. Harmony was
run using default parameters and evaluated and visualised using 20
dimensions.
Scanorama^[201]12 parameters were obtained from online
tutorials:^[202]60 The dataset was normalised using scanpy^[203]57 and
subset to highly variable genes selected by JOINTLY. Each batch was
scaled independently and integrated. The integration was evaluated and
visualised using 50 dimensions.
LIGER^[204]7 parameters were obtained from online tutorials:^[205]61
The dataset was normalised, scaled (per batch, without centring), and
dimensionally reduced using Seurat based on highly variable genes
selected by JOINTLY. The datasets were integrated using 20 factors and
a lambda of five following my quantile normalisation. The integration
was evaluated and visualised using 20 components.
fastMNN^[206]1 parameters were obtained from online tutorials:^[207]62
The datasets were integrated using auto merge and highly variable genes
selected by JOINTLY. The integration was evaluated and visualised using
20 dimensions.
Seurat^[208]13 parameters were obtained from online tutorials:^[209]63
The dataset was normalised, scaled, and dimensionally reduced using
Seurat based on highly variable genes selected by JOINTLY. Integration
anchors were identified using reciprocal PCA and highly variable genes
selected by JOINTLY. The datasets were integrated using default
parameters, except for datasets with small batches (less than 100
cells) where a k.weight of 50 was used. The integration was evaluated
using 30 dimensions.
scGPT^[210]14 parameters were obtained from online tutorials:^[211]64
The dataset was normalised and scaled using Scanpy based on highly
variable genes selected by JOINTLY. The datasets were integrated using
the parameters supplied in the tutorial, except that cell type labels
were not used for train test splitting making the method unsupervised.
The integration was evaluated using 512 dimensions.
Integration evaluation
We evaluated how well clustering could recover cell types (using only
cell types with at least 10 cells) by clustering each dataset based on
the integrated reduced dimensional space using hierarchical graph-based
clustering^[212]56 into between 2 and 50 clusters. For each clustering
solution, we calculated the adjusted rand index (ARI) between the
clusters and the transferred labels for the entire dataset and each
batch. A higher value indicates better agreement between clusters and
cell type labels.
We evaluated how well cell types and batches mix using the local
inverse Simpson’s Index (LISI)^[213]2, which is a metric that measures
the distributions of categorical variables over local neighbourhoods.
We calculated LISI for cell type labels and batch labels and rescaled
the metrics as previously described^[214]4 such that a value of 0
corresponds to low cell type separation and low batch integration,
whereas a value of 1 corresponds to high cell type separation and high
batch integration,
Finally, we evaluated the distance between cell types and batches using
the average silhouette width (ASW). The silhouette width measures the
relationship between within-group distances and between-group
distances. We calculated scores for cell types and batches as
previously described^[215]4, where scores are transformed such that
values of 0 correspond to worst performance (low cell type separation
and high batch separation), whereas values of 1 correspond to best
performance (high cell type separation and low batch separation).
To account for the stochasticity of some of the methods, we ran all
methods five times and reported across all metrics for the integration
run with the highest overall ARI. Ranks for each metric and dataset was
calculated using two significant digits and setting the minimum rank
for ties. The overall rank across tissues was calculated by averaging
across the ranks of individual metrics and setting the minimum rank for
ties.
Neighbourhood conservation
To evaluate neighbourhood conservation between two datasets, we
calculated the 100 nearest neighbours for each cell in each dataset
based on the dataset-specific reduced dimensional space from JOINTLY,
and for cells represented in both datasets, we summed the number of
shared neighbours. This total of shared neighbours is divided by the
total number of possible shared neighbours.
Over-correction analysis
Dataset, integration, and evaluation
Datasets were merged pairwise to generate all possible combinations of
the 5 datasets used for benchmarking. To merge datasets, the datasets
were subset to genes detected in both datasets and combined.
The combined datasets were integrated as described under ‘Benchmarking’
with the exception that JOINTLY was run using PCA rather than consensus
PCA. To evaluate integration, we calculated an over-correction score by
calculating the integration LISI using the tissue as the label. Next,
we split each dataset into each tissue and calculated the integration
and cell type LISI within each tissue. Ranks for each metric were
calculated using three significant digits and setting the minimum rank
for ties. The overall rank was calculated by averaging across the
overall rank within each tissue separately and the over-correction rank
and setting the minimum rank for ties.
Differential expression analysis
To identify differentially expressed genes, we created pseudo-bulk
expression levels per donor for each cell type in each tissue. Next, we
prefiltered genes that did not have at least 10 counts across all
samples We performed pairwise differential expression between all cell
types. To identify shared endothelial cell marker genes, we extracted
all pairwise tests between endothelial cells from the Lung or the
Pancreas dataset and any non-endothelial cell type. We combined the
p-values from all tests using the harmonic mean and FDR-corrected the
combined p-value. Finally, shared marker genes were defined as genes
with a combined FDR ≤ 0.01 and log2 fold change ≥ 1.5 in all pairwise
tests. Endothelial state markers were defined as genes with an
FDR ≤ 0.01 and absolute log2 fold change ≥ 1.5 between Lung- and
Pancreas-derived endothelial cells.
Pathway and transcription factor identification
Enrichment analysis of gene modules was performed using enrichR^[216]65
with default parameters using the Reactome_2022 databases for pathway
identification and ChEA_2022 for transcription factor identification.
P-values were corrected using FDR, and terms with an FDR-corrected
P-value less than 0.05 were kept.
Interpretability analysis
Datasets
Emont^[217]25: Counts were downloaded from the Single Cell Portal under
accession code [218]SCP1376. The dataset was subset to total visceral
adipose tissue samples, containing 80,0085 cells across 10 donors.
Mouse gastrulation^[219]28: Counts were downloaded using scVelo^[220]27
and contain 9815 cells across three sequencing batches.
Cell type labelling
Enrichment analysis of gene modules was performed using enrichR^[221]65
with default parameters using the Azimuth_Cell_Types_2021 and
CellMarker_Augmented_2021 databases. In the Azimuth_Cell_Types_2021,
which does not contain categories specifically from adipose tissue, all
terms associated with Neurons were removed, as well as terms without
cell type or tissue names. In the CellMarker_Augment_2021, which does
contain categories specifically associated with adipose tissue, we only
retained categories from adipose tissue, excluding brown and beige
adipose tissue. All terms with a false discovery rate corrected P-value
less than 0.1 were kept and for each tested factor, the top three terms
ordered by the combined score^[222]65 were reported.
Pathway identification
Enrichment analysis was performed using enrichR^[223]65 with default
parameters using the WikiPathway_2021_Human, KEGG_2021_Human, and
Reactome_2022 databases for the human adipose tissue and the
WikiPathways_2019_Mouse database for the mouse gastrulation dataset.
P-values were corrected using FDR, terms with an FDR-corrected P-value
less than 0.05 were kept and selected pathways were shown.
Constructing a white adipose tissue atlas (WATLAS)
Datasets
Tabula Sapiens^[224]15: Counts were download from Figshare under
10.6084/m9.figshare.14267219^[225]66 and cells from the adipose tissue
from two donors were extracted. Subsequently, cells with an abnormal
relationship between the number of detected genes and the total counts
were removed in two rounds by fitting a linear regression model to the
two variables and removing cells with an absolute residual above 2 in
the first round and above 0.8 in the second round. Finally, cells with
a fraction of counts derived from mitochondrial genes above 0.2 were
removed.
Jaitin^[226]31: Counts from one donor were downloaded from the NCBI
Gene Expression Omnibus under accession code [227]GSE128518 and
metadata from Bitbucket^[228]67 repository
amitlab/adipose-tissue-immune-cells-2019
[[229]https://bitbucket.org/amitlab/adipose-tissue-immune-cells-2019/sr
c/master/]. Cells passing quality control by the original authors (as
indicated in the metadata) were kept and further filtered to remove
cells with an abnormal relationship between the number of detected
genes and the total counts were removed by fitting a linear regression
model to the two variables and removing cells with an absolute residual
above 0.8 for run 22 and 0.7 for run 55. Finally, cells with a fraction
of counts derived from mitochondrial genes above 0.15 were removed.
Vijay^[230]30: Counts were downloaded from the NCBI Gene Expression
Omnibus under accession code [231]GSE129363. Cells with a
log2-transformed total count and a total number of features higher than
six were kept. Next, cells with an abnormal relationship between the
number of detected genes and the total counts were removed by fitting a
linear regression model to the two variables and removing cells with an
absolute residual above 1.0. Finally, cells with a fraction of counts
derived from mitochondrial genes above 0.2 were removed, and the two
diabetic donors were removed.
Hildreth^[232]29: Counts from six donors were downloaded from the NCBI
Gene Expression Omnibus under accession code [233]GSE155960. All
datasets were merged and cells with an abnormal relationship between
the number of detected genes and the total counts were removed by
fitting a linear regression model to the two variables and removing
cells with an absolute residual above 0.8. Finally, cells less than 500
counts or a fraction of counts derived from mitochondrial genes above
0.15 were removed.
Emont:^[234]25 Counts were downloaded from the Single Cell Portal under
accession code [235]SCP1376. The dataset was split into two independent
sets containing samples from the stromal vascular fraction and whole
adipose tissue, respectively.
Barboza:^[236]32 We obtained a dataset from whole adipose tissue from
the original authors.
Integration
After processing and filtering, each dataset was processed with JOINTLY
using 15 components, clustered using hierarchical graph-based
clustering^[237]56 into coarse cell types, and labelled based on marker
genes. Clusters, that could not be assigned to specific cell types, had
high mitochondrial content, or low specificity of marker genes were
removed. We concatenated the datasets, harmonising metadata, and gene
names, retaining only genes with a total count above 10 and expressed
in at least one cell. We integrated the datasets using scANVI^[238]33
using 15 latent dimensions and parameters from scArches^[239]68. The
cells were clustered and separated into fine cell types and states, as
well as removing a few clusters containing low-quality cells (8097
cells removed, 2.6%). The final dataset was re-integrated using scANVI
and the fine-grained labels.
Composition analysis
Cell type or state fractions were calculated for each donor and
modelled using linear regression correcting for variables as indicated
in the figures. Marginalised effect sizes and confidence intervals were
calculated using emmeans.
Cell type decomposition
GTEx adipose bulk RNA-seq data were deconvoluted using bisque^[240]69.
Pseudo-bulk references were generated for both VAT and SAT depots of