Graphical abstract
graphic file with name fx1.jpg
[33]Open in a new tab
Highlights
* •
Integrated LIANA and Tensor-cell2cell for cell-cell communication
analysis across samples
* •
Enables flexible selection of methods and resources for cell-cell
communication inference
* •
Provides step-by-step analysis, accompanied by online tutorials in
Python and R
* •
Demonstrates broad applicability to single-cell data in diverse
biological conditions
Motivation
Multiple cell-cell communication (CCC) tools exist, yet results are
specific to the tool of choice due to the diverse assumptions made
across computational frameworks. Moreover, tools are often limited to
analyzing single samples or performing pairwise comparisons. As
experimental design complexity and sample numbers continue to increase
in single-cell datasets, so does the need for versatile methods to
decipher cell-cell communication in such scenarios. By integrating
LIANA and Tensor-cell2cell, we present a protocol that enables the use
of a diverse array of tools and resources to assess interpretable CCC
programs across multiple samples.
__________________________________________________________________
By integrating LIANA and Tensor-cell2cell, Baghdassarian et al. provide
a unified protocol for unsupervised analysis of cell-cell communication
(CCC) across multi-sample single-cell datasets. By facilitating CCC
method selection and identification of context-driven communication
programs, this approach garners insights into diverse biological
processes. This work is accompanied by user-friendly tutorials for both
Python and R.
Introduction
Cell-cell communication (CCC) coordinates higher-order biological
functions in multicellular organisms,[34]^1^,[35]^2 dictating
phenotypes in response to different contexts such as disease state,
spatial location, and organismal life stage. In recent years, many
tools have been developed to leverage single-cell and spatial
transcriptomics data to study CCC events driving various biological
processes.[36]^2^,[37]^3^,[38]^4 While each computational strategy
contributes unique and valuable developments, many are tool specific
and challenging to integrate due to the large number of different
inference methods and resources housing prior
knowledge.[39]^2^,[40]^5^,[41]^6^,[42]^7 Moreover, most tools do not
account for the relationships of coordinated CCC events (CCC programs)
across different contexts,[43]^8 either disregarding context altogether
by analyzing samples individually or being limited to pairwise
comparisons. Thus, as the ability to generate large single-cell and
spatial transcriptomics datasets and the interest in studying CCC
programs continue to increase,[44]^9^,[45]^10^,[46]^11 the need to
robustly decipher CCC is becoming essential.
Comparison with other methods
A plethora of ligand-receptor (LR) methods have emerged, most of which
were published with their own resources.[47]^1^,[48]^5^,[49]^12 Many of
these provide distinct scoring functions to prioritize interactions,
yet studies have reported low agreement between their
predictions.[50]^5^,[51]^13^,[52]^14 Due to the lack of a gold
standard, the benchmark of these methods remains limited,[53]^2^,[54]^5
and it is challenging to choose the method that works best. To this
end, in addition to providing multiple individual methods via
ligand-receptor analysis framework (LIANA), we also enable their
consensus, which we use in this protocol, under the assumption that the
wisdom of the crowd is less biased than any individual method.[55]^15
While many methods exist to infer ligand-receptor interactions from a
single sample, fewer approaches were designed to compare CCC
interactions across conditions. These include CrossTalkeR,[56]^16 which
utilizes network topological measures to compare communication
patterns, CellPhoneDB,[57]^17 which accepts user-provided lists of
differentially expressed genes to return relevant ligand-receptor
interactions, and scDiffCom,[58]^18 which uses a combined permutation
approach across both cell types and conditions. Still, the
aforementioned approaches are limited to pairwise comparisons. Other
approaches can directly compare CCC across more than two conditions;
however, their analysis often relies on pairwise[59]^19 or
targeted[60]^20 comparisons to integrate multiple samples. A key
feature of Tensor-cell2cell is that it considers all samples
simultaneously while preserving the relationships between
ligand-receptor interactions and communicating cell-type pairs. Thus,
Tensor-cell2cell preserves higher-order CCC relationships and
translates those into mechanistic CCC programs of potentially
interacting ligands, receptors, and communicating cell types.
Development of the protocol
We combine two independent yet highly complementary tools that leverage
existing methods to enable robust and hypothesis-free analysis of
context-driven CCC programs ([61]Figure 1). LIANA[62]^5 is a
computational framework that implements multiple available
ligand-receptor resources (i.e., database of ligand-receptor
interactions) and methods to analyze CCC. In particular, the user can
employ LIANA to select any method and resource of choice or combine
multiple approaches simultaneously to obtain consensus predictions.
Tensor-cell2cell[63]^12 is a dimensionality reduction approach devised
to uncover context-driven CCC programs across multiple samples
simultaneously. Specifically, Tensor-cell2cell uses CCC scores inferred
by any method and arranges the data into a four-dimensional (4D) tensor
to capture the coordinated relationship between ligand-receptor
interactions, communicating cell-type pairs, and samples. Together,
LIANA and Tensor-cell2cell unify existing approaches to enable
researchers to easily use their preferred CCC resource and method and
subsequently analyze any number of samples into biologically relevant
CCC insights without the additional complications of installing
separate tools or reconciling discrepancies between them.
Figure 1.
[64]Figure 1
[65]Open in a new tab
Integration of LIANA and Tensor-cell2cell to identify context-driven
programs of cell-cell communication
LIANA and Tensor-cell2cell can be used together to infer the molecular
basis of cell-cell interactions by running analysis across multiple
samples, conditions, or contexts. Given a method, resource, and
expression data, LIANA outputs CCC scores for all interactions in a
sample. We adapted both tools to be highly compatible with each other,
so LIANA outputs can be directly passed to Tensor-cell2cell to detect
the programs from the scores computed with LIANA. Tensor-cell2cell uses
the communication scores generated for multiple samples to identify
context-driven CCC programs.
For this protocol, we adapted LIANA and Tensor-cell2cell to enable
their smooth integration. Thus, our protocol demonstrates the concerted
use of both tools, describes the insights they provide, and facilitates
the interpretation of their outputs. We base this protocol on recent
best practices for single-cell transcriptomics and CCC
inference.[66]^21 We begin by processing the key inputs of our tools.
Then, we guide the selection of methods and prior-knowledge resources
to score intercellular communication using LIANA’s consensus method and
resource to infer the potential CCC events for each sample. We use
Tensor-cell2cell to summarize the intercellular communication events
across samples, and we describe key technical considerations to enable
consistent decomposition results. Finally, we guide the interpretation
of the decomposition results and show multiple downstream analyses and
visualizations to facilitate interpretation of the context-dependent
CCC programs. For example, we illustrate how biologically relevant
results can be obtained by coupling the outputs with pathway enrichment
analyses. We also provide quick-start and in-depth online tutorials
with detailed descriptions of all steps described in this protocol and
their crucial parameters. All these materials are available in both
Python and R at [67]https://ccc-protocols.readthedocs.io/. While here
we showcase an analysis on coronavirus disease 2019 (COVID-19) data,
online tutorials also show applications on transcriptomics data of
lupus peripheral blood mononuclear cells and spatial transcriptomics
data of myocardial infarction, further demonstrating the adaptability
of our combined tools. Collectively, these materials provide a
comprehensive and flexible playbook to investigate CCC from single-cell
transcriptomics.
Applications of the protocol
LIANA and Tensor-cell2cell have been used for diverse purposes. LIANA
was initially used to compare and evaluate different ligand-receptor
methods in diverse biological contexts. Tensor-cell2cell was originally
applied to link CCC programs with different severities of COVID-19 and
autism spectrum disorder (ASD).[68]^12 Briefly, LIANA evaluated
different methods and showed that they have limited agreement in terms
of communication mechanisms,[69]^5^,[70]^12 while Tensor-cell2cell
revealed distinct CCC program dysregulations associated with severe
COVID-19 specifically rather than moderate cases, as well as
combinations of programs distinguishing ASD from neurotypical samples.
Notably, LIANA provides a consensus resource and can aggregate multiple
methods into consensus communication scores. Additionally, there is a
natural complementarity between the two tools, as Tensor-cell2cell can
use input scores from any CCC method ([71]Figure 1) and generates
consistent decomposition results across methods. Thus, our tools are
highly generalizable and applicable to the analysis of any single-cell
transcriptomics datasets. For example, LIANA has been used for the
analysis of myocardial infarction[72]^22 and transforming growth factor
β signaling in breast cancer,[73]^23 among others. Our tools are also
applicable to other data modalities containing potentially interacting
cell populations. Specifically, one can adapt LIANA or use existing
spatial tools[74]^24 and combine their outputs with Tensor-cell2cell to
generate spatially informed CCC insights across contexts. Similarly,
one can also obtain metabolite-mediated intercellular
interactions[75]^25^,[76]^26 and decompose those into patterns across
contexts with Tensor-cell2cell.[77]^27 One can also apply
Tensor-cell2cell to extract CCC programs occurring at specific
tissues[78]^28 or at a whole-body organism level.[79]^28^,[80]^29 In
this protocol, we focus on how one can leverage the different CCC
methods and resources, generalized by LIANA, to infer context-dependent
CCC programs with Tensor-cell2cell from single-cell transcriptomics
data.
Results
In this section, we introduce our protocol ([81]Figure 2) using Python.
The same protocol is implemented in R and is available online at
[82]https://ccc-protocols.readthedocs.io/en/latest/notebooks/ccc_R/Quic
kStart.html.
Figure 2.
[83]Figure 2
[84]Open in a new tab
Overview of the protocol for inferring cell-cell communication through
LIANA and Tensor-cell2cell
Main inputs, steps, resources, and options are summarized for the
distinct steps of this protocol.
(A) A preprocessed gene expression matrix according to the best
practices of single-cell analysis is expected as input ([85]data
preprocessing in the [86]results section).
(B) The input data are integrated with the ligand-receptor resources
available in LIANA to infer cell-cell communication using any of the
methods implemented in LIANA ([87]inferring cell-cell communication in
the [88]results section). An output containing the cell-cell
communication scores across all interactions per sample is generated.
(C) The LIANA output is then directly passed to Tensor-cell2cell to
build the respective communication tensor used by the tensor component
analysis ([89]building a 4D-communication tensor and [90]running
Tensor-cell2cell across samples in the [91]results section). The output
generated by Tensor-cell2cell can be later employed for other
downstream analyses ([92]downstream visualizations: making sense of the
factors and [93]pathway enrichment analysis: interpreting the
context-driven communication in the [94]results section).
Step 1: Installation and environment setup
Install Anaconda or Miniconda through the official instructions at
[95]https://docs.anaconda.com/anaconda/install/index.html.
Then, open a terminal to create and activate a conda environment.
conda create -n ccc_protocols
conda activate ccc_protocols
If you will be using a graphics processing unit (GPU), then install
PyTorch using conda.
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c
pytorch -c nvidia
Install Tensor-cell2cell, LIANA, and decoupler using PyPI.
pip install cell2cell liana decoupler
For fully reproducible runs of our tutorials in both Python and R, we
have specified the required packages and their versions in the software
requirements table ([96]STAR Methods). You can also follow instructions
in the [97]environment setup section to install a clean virtual
environment with all package requirements.
Notebooks to run this tutorial can be created by starting a Jupyter
Notebook.
jupyter notebook
Step 2: Initial setups
First, if you are using an NVIDIA GPU with Compute Unified Device
Architecture (CUDA) cores, then set “use_gpu = True” and enable PyTorch
with the following code block. Otherwise, set “use_gpu = False” or skip
this part.
use_gpu = True
if use_gpu:
import tensorly as tl
tl.set_backend('pytorch')
Then, import all the packages we will use in this tutorial.
import cell2cell as c2c
import liana as li
import pandas as pd
import decoupler as dc
import scanpy as sc
import matplotlib.pyplot as plt
%matplotlib inline
import plotnine as p9
import seaborn as sns
Afterward, specify the data and output directories.
data_folder = '../../data/quickstart/'
output_folder = '../../data/quickstart/outputs/'
c2c.io.directories.create_directory(data_folder)
c2c.io.directories.create_directory(output_folder)
We begin by loading the single-cell transcriptomics data. For this
tutorial, we will use a lung dataset of 63,000 immune and epithelial
cells across three control, three moderate, and six severe COVID-19
patients (Zenodo Data:
[98]https://doi.org/10.5281/zenodo.7706962).[99]^30 We use a convenient
function to download the data and store it in the AnnData format, on
which the scanpy[100]^31 package is built.
adata = c2c.datasets.balf_covid(data_folder +
'/Liao-BALF-COVID-19.h5ad')
Step 3: Data preprocessing
Data preprocessing is crucial for the correct application of this
([101]Figure 2A). Here, we only highlight the essential steps. However,
other aspects of data preprocessing should be considered and performed
according to the best practices of single-cell analysis
([102]https://github.com/theislab/single-cell-best-practices).
Quality control (timing: <5 min)
The loaded data have already been preprocessed to a degree and come
with cell annotations. Nevertheless, we highlight some of the key
steps. To mitigate noise, we filter non-informative cells and genes.
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
We additionally remove a high mitochondrial content.
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,
qc_vars=['mt'],
percent_top=None,
log1p=False,
inplace=True)
adata = adata[adata.obs.pct_counts_mt < 15, :]
This is followed by removing cells with a high number of total unique
molecular identifier (UMI) counts, potentially representing more than
one single cell (doublets):
adata = adata[adata.obs.n_genes < 5500,:]
Caution: Here, we covered the absolute basics. We omit other common
practice steps, such as the removal of doublets and cells with high
ribosomal content and the correction of ambient RNA. Additionally, in
certain scenarios, particularly in those where technical variation is
expected to be notable, the application of quality control steps by
sample is desirable.[103]^21
Normalization (timing: <2 min)
We have now removed the majority of noisy readouts and can proceed to
count normalization, as most CCC tools typically use normalized count
matrices as input. Normalized counts are usually obtained in two
essential steps, the first being count depth scaling, which ensures
that the measured count depths are comparable across cells. This is
then usually followed up with log1p transformation, which stabilizes
the variance of the counts and enables the use of linear metrics
downstream.
# Save the raw counts to a layer
adata.layers["counts"] = adata.X.copy()
# Normalize the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
Critical: A key parameter of this command is as follows:
* •
“target_sum” ensures that after normalization, each observation
(cell) has a total count equal to that number.
These normalization steps ensure that the aggregation of cells into
cell types, a common practice for CCC inference, is done on comparable
cells with approximately normally distributed feature values.
Troubleshooting: Expression matrices with “not a number” (nan),
negative, or infinity (inf) values cause errors. Users should stick to
common normalization techniques, and any nan, negative, or inf values
must be filled to avoid errors.
Step 4: Inferring CCC
Following preprocessing of the single-cell transcriptomics data, we
proceed to the inference of potential CCC events ([104]Figure 2B). In
this case, we will use LIANA to infer the ligand-receptor interactions
for each sample. LIANA is available in Python and R and supports
Scanpy, SingleCellExperiment, and Seurat objects as input. LIANA is
highly modularized, and it natively implements the formulations of
several methods, including CellPhoneDBv2,[105]^32 Connectome,[106]^33
log2 fold change (log2FC), NATMI,[107]^34 SingleCellSignalR,[108]^35
CellChat,[109]^19 and a geometric mean, as well as a consensus score in
the form of a rank aggregate[110]^36 from any combination of methods
([111]Figure 3). The high modularity of LIANA further enables the
straightforward addition of any other ligand-receptor method.
Figure 3.
[112]Figure 3
[113]Open in a new tab
LIANA is a user-friendly and modular ligand-receptor analysis framework
LIANA provides a variety of methods and resources to infer cell-cell
communication, making it easy to use multiple existing methods in a
coherent manner. It also provides consensus scores and resources to
provide generalized results. Figure was adapted from Dimitrov
et al.[114]^5
LIANA classifies the scoring functions from the different methods into
two categories: those that infer the “magnitude” and “specificity” of
interactions. The magnitude of an interaction is a measure of the
strength of the interaction, and the specificity of an interaction is a
measure of how specific an interaction is to a given pair of cell
groups. Generally, these categories are complementary, and the
magnitude of the interaction is often in agreement with the specificity
of the interaction. In other words, a ligand-receptor interaction with
a high magnitude score in a given pair of cell types is likely to also
be specific, and vice versa.
Selecting a method to infer CCC
While there are many commonalities between the different methods
implemented in LIANA, there also are many variations and different
assumptions affecting how the magnitude and specificity scores are
calculated (see [115]STAR Methods). These variations can result in
limited agreement in inferred predictions when using different CCC
methods.[116]^5^,[117]^13^,[118]^14 To this end, in LIANA, we
additionally provide a “rank_aggregate” score, which can be used to
aggregate any of the scoring functions above into a consensus score.
By default, LIANA calculates an aggregate rank using a
re-implementation of the RobustRankAggregate method[119]^36 and
generates a probability distribution for ligand-receptors that are
ranked consistently better than expected under a null hypothesis (see
[120]STAR Methods). The consensus of ligand-receptor interactions
across methods can therefore be treated as a p value. We show in detail
how LIANA’s rank aggregate or any of the individual methods can be used
to infer communication events from a single sample or context at
“Python Tutorial 02 Infer-Communication-Scores”
([121]https://ccc-protocols.readthedocs.io/en/latest/notebooks/ccc_pyth
on/02-Infer-Communication-Scores.html).
Critical: When using LIANA with Tensor-cell2cell, we recommend
selecting a scoring function that reflects the magnitude of the
interactions, as how the interactions’ specificity relates to changes
across samples is unclear. In this protocol, we will use the
“magnitude_rank” scoring function from LIANA, under the assumption that
ensemble approaches are potentially less biased than any single method
alone.[122]^15
We further show that Tensor captures consistent CCC programs when using
different methods and add a tutorial to explore method consistency on
any dataset:
[123]https://ccc-protocols.readthedocs.io/en/latest/notebooks/ccc_pytho
n/S3B_Score_Consistency.html.
Troubleshooting: The default decomposition method of Tensor-cell2cell
is a non-negative tensor component analysis, which, as implied, expects
non-negative values as the inputs. Thus, when selecting the method of
choice, make sure that you do not have negative CCC scores. If so, you
can replace them by zeros or the minimum positive value.
Selecting ligand-receptor resources
When considering ligand-receptor prior-knowledge resources, a common
theme is the trade-off between coverage and quality, and similarly,
each resource comes with its own biases.[124]^5 LIANA takes advantage
of OmniPath,[125]^37 which includes expert-curated resources of
CellPhoneDBv2,[126]^32 CellChat,[127]^19 ICELLNET,[128]^38
connectomeDB2020,[129]^34 and CellTalkDB,[130]^39 as well as 10
others.[131]^5^,[132]^37 LIANA further provides a consensus
expert-curated resource from the aforementioned five resources, along
with some curated interactions from SignaLink.[133]^40 In this
protocol, we will use the consensus resource from LIANA, though any of
the other resources are available via LIANA, and one can also use LIANA
with their own custom resource.
Selecting any of the lists of ligand-receptor pairs in LIANA can be
done through the following command.
lr_pairs = li.resource.select_resource('consensus')
Here, “consensus” indicates the use of LIANA’s consensus resource, but
it can be replaced by any other available resource (e.g.,
“cellphonedb,” “cellchatdb,” “connectomeDB,” etc.).
Note that any of the resources available in LIANA can be used by
passing them as a string to “resource_name.” All of LIANA’s resources
can be listed with “li.resource.show_resources().” Users can also
provide custom resources as a pandas DataFrame to run in LIANA so long
as they are formatted the same as other resources (i.e., include two
columns named ligand and receptor, containing the respective partners
in the ligand-receptor interactions). Hence, users may pass a dataframe
containing a personalized list of interactions to liana using the
“resource” parameter in the next “rank_aggregate” function below.
Troubleshooting: Users should choose a resource with gene identifiers
and an organism that corresponds to that of their data. By default,
LIANA uses human gene symbol identifiers but additionally provides a
murine resource as well as functionalities to convert via orthology to
other organisms.
Running LIANA for each sample (timing: 4 min)
Here, we will run LIANA’s “rank_aggregate” with six methods (by
default, CellPhoneDBv2, CellChat, SingleCellSignalR, NATMI, Connectome,
and log2FC) on all of the samples in the dataset.
li.mt.rank_aggregate.by_sample(adata,
sample_key='sample_new',
groupby='celltype', resource_name='consensus',
expr_prop=0.1,
min_cells=5,
n_perms=100,
use_raw=False,
verbose=True,
inplace=True
)
Critical: Key parameters here are as follows:
* •
“adata” stands for AnnData, the data format used by scanpy.[134]^31
* •
“sample_key” corresponds to the sample identifiers, available as a
column in the “adata.obs” dataframe.
* •
“groupby” corresponds to the cell group label stored in
“adata.obs.”
* •
“resource_name” is the name of any of the resources available via
LIANA.
* •
“expr_prop” is the expression proportion threshold (in terms of
cells per cell type expressing the protein) for any protein subunit
involved in the interaction, according to which we keep or discard
the interactions.
* •
“min_cells” is the minimum number of cells per cell type required
for a cell type to be considered in the analysis.
* •
“n_perms” is the number of permutations for p value estimation.
* •
“use_raw” is a Boolean that indicates whether to use the
“adata.raw” slot; here, the log-normalized counts are assigned to
“adata.X,” and other options include passing the name of a layer
via the “layer” parameter or using the counts stored in
“adata.raw.”
Critical: LIANA considers interactions as occurring only if the ligand
and receptor, and all of their subunits, are expressed in at least 10%
of the cells (by default) in both clusters involved in the interaction.
Any interactions that do not pass these criteria are not returned by
default. To return those, the user can use the “return_all_lrs”
parameter. These results will later be used to generate a tensor of
ligand-receptor interactions across contexts that will be decomposed
into CCC programs by Tensor-Cell2cell. Thus, how non-expressed
interactions are handled is critical to consider when building the
tensor later on (see “Python Tutorial 03 Generate-Tensor”
([135]https://ccc-protocols.readthedocs.io/en/latest/notebooks/ccc_pyth
on/03-Generate-Tensor.html).
One can visualize the output as a dot plot while including every sample
in the dataset.
li.pl.dotplot_by_sample(adata=adata,
colour='magnitude_rank',
size='specificity_rank',
source_labels=["B", "pDC", "Epithelial"],
target_labels=["Macrophages", "Mast", "pDC", "NK"],
ligand_complex=['VIM', 'SCGB3A1'],
receptor_complex=['CD44', 'MARCO'],
sample_key='sample_new',
inverse_colour=True,
inverse_size=True,
figure_size=(14, 10),
size_range=(1, 6),
)
Key parameters here are as follows:
* •
“source_labels” is a list containing the names of the sender cells
of interest.
* •
“target_labels” is a list containing the names of the receiver
cells of interest.
* •
“ligand_complex” is a list containing the names of the ligands of
interest.
* •
“receptor_complex” is a list containing the names of the receptors
of interest.
* •
“sample_key” is a string containing the column name where samples
are specified.
This command leads to the generation of [136]Figure 4.
Figure 4.
[137]Figure 4
[138]Open in a new tab
Dot plot of cell-cell communication between immune cells per sample
Here, sender and receiver cells are represented as source and target (y
and x axes, respectively). Each major column groups cells by sample,
while each major row groups cells by the ligand-receptor interaction
they are using. Dot size represents the specificity (ranks) assigned by
LIANA, while the color represents the magnitude (ranks) of the
interaction.
Pause point: We can export the LIANA results by sample to a CSV and
save them for later use.
adata.uns['liana_res'].to_csv(output_folder + '/LIANA_by_sample.csv',
index=False)
Alternatively, one could just export the whole AnnData object, together
with the ligand-receptor results stored at “adata.uns[‘liana_res’].”
adata.write_h5ad(output_folder + '/adata_processed.h5ad',
compression='gzip')
Step 5: Comparing CCC across multiple samples
Building a 4D-communication tensor (timing: <1 min)
First, we generate a list containing all samples from our AnnData
object. For visualization purposes, we sort them according to COVID-19
severity. Here, we can find the names of each of the samples in the
“sample_new” column of the adata.obs information.
sorted_samples = sorted(adata.obs['sample_new'].unique())
Tensor-cell2cell performs a tensor decomposition to find
context-dependent patterns of CCC. It builds a 4D-communication tensor,
which, in this case, is built from the communication scores obtained
from LIANA for every combination of ligand-receptor and sender-receiver
cell pairs per sample ([139]Figures 2B and 2C). For this protocol and
associated tutorials, we implemented a function that facilitates
building this communication tensor.
tensor = li.multi.to_tensor_c2c(liana_res=adata.uns['liana_res'],
sample_key='sample_new',
source_key='source',
target_key='target',
ligand_key='ligand_complex',
receptor_key='receptor_complex',
score_key='magnitude_rank',
inverse_fun=lambda x: 1 - x,
how='outer',
outer_fraction=1/3.,
context_order=sorted_samples,
)
Troubleshooting: Since the “magnitude_rank” from LIANA represents a
score where the values closest to 0 represent the most probable
communication events, we need to invert the communication scores to use
it with Tensor-cell2cell. See the parameter “inverse_fun” below for
further details for transforming this score.
Critical: Key parameters here are as follows:
* •
“liana_res” is the dataframe containing the results from LIANA,
usually located in “adata.uns[‘liana_res’].” We can pass directly
the AnnData object to the parameter adata to this function. If the
AnnData object is passed, then we do not need to specify the
liana_res parameter.
* •
“sample_key,” “source_key,” “target_key,” “ligand_key,”
“receptor_key,” and “score_key” are the column names in the
dataframe containing the samples, sender cells, receiver cells,
ligands, receptors, and communication scores, respectively. Each
row of the dataframe contains a unique combination of these
elements.
* •
“inverse_fun” is the function we use to convert the communication
score before building the tensor. In this case, the
“magnitude_rank” score generated by LIANA considers low values as
the most important ones, ranging from 0 to 1. In contrast,
Tensor-cell2cell requires higher values to be the most important
scores, so here we pass a function (lambda x: 1 − x) to adapt
LIANA’s magnitude-rank scores (subtracts LIANA’s score from 1). If
“None” is passed instead, then no transformation will be performed
on the communication score. If using other scores coming from one
of the methods implemented in LIANA, then a similar transformation
can be done depending on the parameters and assumptions of the
scoring method.
* •
“how” controls “which” ligand-receptor pairs and cell types to
include when building the tensor. This decision depends on whether
the missing values across a number of samples for both
ligand-receptor interactions and sender-receiver cell pairs are
considered to be biologically relevant. Options are as follows:
+ o
“inner” is the most strict option since it only considers cell
types and ligand-receptor pairs that are present in all
contexts (intersection).
+ o
“outer” considers all cell types and ligand-receptor pairs
that are present across contexts (union).
+ o
“outer_lrs” considers only cell types that are present in all
contexts (intersection) but all ligand-receptor pairs that are
present across contexts (union).
+ o
“outer_cells” considers only ligand-receptor pairs that are
present in all contexts (intersection) but all cell types that
are present across contexts (union).
* •
“outer_fraction” controls the elements to include in the union
scenario of the how options. Only elements that are present at
least in this fraction of samples/contexts will be included. When
this value is 0, the tensor includes all elements across the
samples. When this value is 1, it acts as using how = “inner.”
* •
“context_order” is a list specifying the order of the samples. The
order of samples does not affect the results, but it is useful for
posterior visualizations.
We can check the shape of this tensor to verify the number of samples,
ligand-receptor pairs, sender cells, and receiver cells, respectively:
tensor.shape
In addition, optionally, we can generate the metadata for coloring the
elements in each of the tensor dimensions (i.e., for each of the
contexts/samples, ligand-receptor pairs, sender cells, and receiver
cells) in posterior visualizations. These metadata correspond to
dictionaries for each of the dimensions containing the elements and
their respective major groups, such as a signaling categories for a
ligand-receptor interactions, a hierarchically more granular cell type,
or a disease condition for a sample. In cases where we do not account
for such information, we do not need to generate such dictionaries.
For example, we can build a dictionary for the contexts/samples
dictionary by using the metadata in the AnnData object. In this example
dataset, we can find samples in the column “sample_new,” while their
major groups (representing COVID-19 severity) are found in the column
“condition.”
context_dict = adata.obs.sort_values(by='sample_new') \
.set_index('sample_new')['condition'] \
.to_dict()
Then, the metadata can be generated with:
dimension_dicts = [context_dict, None, None, None] meta_tensor =
c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor,
metadata_dicts=dimension_dicts, fill_with_order_elements=True)
Notice that the “None” elements in the variable dimensions_dicts
represent the dimensions where we are not including additional
metadata. If you want to include metadata about major groups for those
dimensions, then you have to replace the corresponding “None” with a
dictionary as described before.
Pause point: We can export our tensor and its metadata for performing
the tensor decomposition later:
c2c.io.export_variable_with_pickle(variable=tensor,
filename=output_folder + '/Tensor.pkl')
c2c.io.export_variable_with_pickle(variable=meta_tensor,
filename=output_folder + '/Tensor-Metadata.pkl')
tensor = c2c.io.read_data.load_tensor(output_folder + '/Tensor.pkl')
meta_tensor = c2c.io.load_variable_with_pickle(output_folder +
'/Tensor-Metadata.pkl')
Then, we can load them with:
tensor = c2c.io.read_data.load_tensor(output_folder + '/Tensor.pkl')
meta_tensor = c2c.io.load_variable_with_pickle(output_folder +
'/Tensor-Metadata.pkl')
Running Tensor-cell2cell across samples (timing: 5 min with a “regular” run
or 40 min with a “robust” run, using a GPU in both cases)
Now that we have built the tensor and its metadata, we can run tensor
component analysis via Tensor-cell2cell with one simple command that we
implemented for our unified tools.
c2c.analysis.run_tensor_cell2cell_pipeline(interaction_tensor=tensor,
tensor_metadata=meta_tensor,
rank=None,
tf_optimization='robust',
random_state=0,
device='cuda',
output_folder=output_folder,
)
Critical: Key parameters of this command are as follows:
* •
“rank” is the number of factors or latent patterns we want to
obtain from the analysis. You can either indicate a specific number
or leave it as “None” to perform the decomposition with a suggested
number from an elbow analysis ([140]Figure 5A).
* •
“tf_optimization” indicates whether running the analysis in the
regular or the robust way. It essentially controls the convergence
parameters of the tensor decomposition. The former employs less
strict convergence parameters to obtain optimal results than the
latter, which is also translated into a faster generation of
results.
* •
“random_state” is the seed for randomization. It controls the
randomization used when initializing the optimization algorithm
that performs the tensor decomposition. It is useful for
reproducing the same result every time that the analysis is run. If
“None,” then a different randomization will be used each time.
* •
“device” indicates whether we are using the “cpu” or a GPU with
“cuda” cores. See the [141]installation section of this tutorial
for instructions to enable the use of GPU(s).
Figure 5.
[142]Figure 5
[143]Open in a new tab
Cell-cell communication programs obtained by combining LIANA and
Tensor-cell2cell
(A) Elbow analysis to select an optimal number of factors (rank of the
tensor). The red dot represents the number selected by
Tensor-cell2cell.
(B) After inferring cell-cell communication with LIANA from the
COVID-19 data and running a tensor component analysis with
Tensor-cell2cell, 10 factors were obtained (rows here), each of which
represents a different cell-cell communication program. Within a
factor, loadings (y axis) are reported for each element (x axis) in
every tensor dimension (columns). Elements here are colored by their
major groups as indicated in the legend.
This command will output three main results: a figure with the elbow
analysis for suggesting a number of factors for the decomposition (only
if “rank = None”), a figure with the loadings assigned to each element
within a tensor dimension per factor obtained, and an Excel file
containing the values of these loadings. [144]Figure 5A represents the
elbow plot generated in this case, while [145]Figure 5B shows the
loadings assigned to each tensor dimension per factor.
Troubleshooting: Elbow analysis returns a rank equal to one, or the
curve is increasing instead of decreasing. This may be due to high
sparsity in the tensor. The sparsity can be decreased by re-building
the 4D tensor after re-running LIANA ([146]running LIANA for each
sample) with a smaller “expr_prop” (e.g., “expr_prop = 0.05”) or by
only re-building the tensor ([147]building a 4D-communication tensor)
with a higher “outer_fraction” (e.g., “outer_fraction = 0.8”).
Downstream visualizations: Making sense of the factors (timing: <2 min)
The figure representing the loadings in each factor generated in the
previous section can be interpreted by interconnecting all dimensions
within a single factor ([148]Figure 5B). For each factor, we obtain
four vectors that represent the sample, ligand-receptor interaction,
sender cell type, and receiver cell-type loadings. These loadings are
the relative importance of each element in each dimension of the
original tensor. All these vectors together define a CCC program in
which high loadings represent key cells and mediators. By focusing on
sample loadings associated with a given condition label, we can thus
identify the cell types and interactions also associated with that
label. Relevant interactors can be interpreted according to their
loadings (i.e., ligand-receptor pairs, sender cells, and receiver cells
with high loadings play a more prominent role in an identified CCC
program). Ligands in high-loading ligand-receptor pairs are sent
predominantly by high-loading sender cells and interact with the
cognate receptors on the high-loadings receiver cells. In this regard,
the program would be predominantly driven by changes in the receptor
expression of receiver cells such as macrophages, neutrophils, and
myeloid dendritic cells.
We can access the loading values of samples in each of the factors with
tensor.factors['Contexts']
In this case, we obtain a dataframe where rows represent the samples,
columns the factors generated by the decomposition, and entries the
loadings of each element within the corresponding factor. We can also
access the loadings for the elements in the other dimensions by
replacing “Contexts” with “Ligand-Receptor Pairs,” “Sender Cells,” or
“Receiver Cells.” Then, we can use these loadings to perform various
downstream analyses.
For example, we can use loadings to compare groups of samples
([149]Figure 6) with boxplots and statistical tests.
groups_order = ['Control', 'Moderate COVID-19', 'Severe COVID-19']
fig_filename = output_folder + '/BALF-Severity-Boxplots.pdf'
_ =
c2c.plotting.context_boxplot(context_loadings=tensor.factors['Contexts'
],
metadict=context_dict,
nrows=3,
figsize=(16, 12),
group_order=groups_order,
statistical_test='t-test_ind',
pval_correction='fdr_bh',
cmap='plasma',
verbose=False,
filename=fig_filename
)
Figure 6.
[150]Figure 6
[151]Open in a new tab
Identifying patterns and differences across groups of conditions
Context or sample loadings can be used to compare statistically
different condition groups within the same cell-cell communication
program. Here, COVID-19 patients are grouped by severity, and pairwise
t tests are performed. Here, ∗ and ∗∗ indicate p values lower than 0.05
and 0.01, respectively, while ns means not-significant (or p value
greater than 0.05). The case of “ns” indicates that the significance is
lost after multiple test correction (false discovery rate, in this
case).
Critical: In this case, we can change the statistical test and the
multiple-test correction with the parameters “statistical_test” and
“pval_correction.” Here, we used an independent t test and a
Benjamini-Hochberg correction. Additionally, we can set “verbose =
True” to print exact test statistics and p values.
We can also generate heatmaps for the elements with loadings above a
certain threshold in a given dimension ([152]Figure S1). Furthermore,
we can cluster these elements by the similarity of their loadings
across all factors.
fig_filename = output_folder + '/Clustermap-LRs.pdf'
_ =
c2c.plotting.loading_clustermap(loadings=tensor.factors['Ligand-Recepto
r
Pairs'],
loading_threshold=0.1,
use_zscore=False,
figsize=(28, 8),
filename=fig_filename,
row_cluster=False
)
Troubleshooting: Note that here, we plot the loadings of the dimension
representing the ligand-receptor pairs. In addition, we prioritize the
pairs with high loadings using the parameter “loading_threshold = 0.1.”
In this case, the elements are included only if they are greater than
or equal to that threshold in at least one of the factors. If we use
“loading_threshold = 0,” then we would consider all of the elements.
Considering all of the elements would require modifying the parameter
“figsize” to enlarge the figure.
Caution: Changing the parameter “use_zscore” to “True” would
standardize the loadings of one element across all factors. This is
useful to compare an element across factors and highlight the factors
in which that element is most important. Modifying “row_cluster” to
“True” would also cluster the factors depending on the elements that
are important in each of them.
Furthermore, factor-specific networks of cell-cell interactions
([153]Figure S2) can be visualized by using the loadings of sender and
receiver cells.
threshold = 0.075
c2c.plotting.ccc_networks_plot(tensor.factors,
included_factors=['Factor 3', 'Factor 5', 'Factor 10'],
ccc_threshold=threshold, # Only important communication
nrows=1,
panel_size=(16, 16), # This changes the size of each figure panel.
filename=output_folder + 'Factor-Networks.pdf',
)
Critical: Key parameters of this command are as follows:
* •
“included_factors” is a list of factors to plot. If “None” is
passed, then all factor-specific networks are shown.
* •
“ccc_threshold” is a loading value to set as threshold to select
key cell-cell interactions. This threshold filters the outer
products between sender and receiver cells, and it can be either
arbitrary or determined as shown in the online tutorials.
Step 6: Pathway enrichment analysis: Interpreting the context-driven
communication
The decomposition of ligand-receptor interactions across samples into
loadings associated with the conditions reduces the dimensionality of
the inferred interactions substantially. Nevertheless, we are still
working with 1,054 interactions across multiple factors associated with
the disease labels. To this end, as is commonly done when working with
omics data types, we can perform pathway enrichment analysis to
identify the general biological processes of interest. By using the
loadings for each ligand-receptor pair ([154]Figure 5B), we can rank
them within each factor and use this ranking as input to enrichment
analysis. Pathway enrichment thus serves two purposes: it further
reduces the dimensionality of the inferred interactions and it enhances
the biological interpretability of the inferred interactions.
Here, we show the application of classical gene set enrichment analysis
(GSEA) on the ligand-receptor loadings. We use GSEA[155]^41 with KEGG
Pathways,[156]^42 as well as a multivariate linear regression from
decoupler-py[157]^43 with the PROGENy pathway resource.[158]^44
First, we assign ligand-receptor loadings to a variable.
lr_loadings = tensor.factors['Ligand-Receptor Pairs']
Classic pathway enrichment (timing: <1 min)
For the pathway enrichment analysis, we use ligand-receptor pairs
instead of individual genes. KEGG was initially designed to work with
sets of genes, so first we need to generate ligand-receptor sets for
each of its pathways. A ligand-receptor pair is assigned as part of a
pathway set if all of the genes in the pair are part of the gene set of
such a pathway.
Note that we use the “lr_pairs” database that we loaded in the
[159]selecting ligand-receptor resources section.
# Generate list with ligand-receptors pairs in DB
lr_list = ['^'.join(row) for idx, row in lr_pairs.iterrows()]
# Specify the organism and pathway database to use for building the LR
set
organism = "human"
pathwaydb = "KEGG"
# Generate ligand-receptor gene sets
lr_set = c2c.external.generate_lr_geneset(lr_list,
complex_sep='_',
lr_sep='^',
organism=organism,
pathwaydb=pathwaydb,
readable_name=True,
output_folder=output_folder
)
Critical: Key parameters of this command are as follows:
* •
“complex_sep” indicates the symbol separating the gene names in the
protein complex.
* •
“lr_sep” is the symbol separating a ligand and a receptor complex.
* •
“organism” is the organism matching the gene names in the
single-cell dataset. It could be either “human” or “mouse.”
* •
“pathwaydb” is the name of the database to be loaded, provided with
the cell2cell package. Options are “GOBP,” “KEGG,” and “Reactome.”
Run GSEA via cell2cell, which calls the “gseapy.prerank” function
internally.
pvals, scores, gsea_df = c2c.external.run_gsea(loadings=lr_loadings,
lr_set=lr_set,
output_folder=output_folder,
weight=1,
min_size=15,
permutations=999,
processes=6,
random_state=6,
significance_threshold=0.05,
)
Critical: Key parameters of this command are as follows:
* •
“lr_set” is a dictionary associating pathways (keys) with sets of
ligand-receptor pairs (values).
* •
“weight” represents the original parameter p in GSEA. It is an
exponent that controls the importance of the ranking values
(loadings, in our case).
* •
“min_size” indicates the minimum number of LR pairs that a set has
to contain to be considered in the analysis.
* •
“permutations” indicates the number of permutations to perform to
generate the null distribution.
* •
“random_state” is the reproducibility seed.
* •
“significance_threshold” is the p value threshold to consider
significance.
Now that we have obtained the normalized enrichment scores (NESs) and
corresponding p values from GSEA, we can plot those using the following
function from cell2cell ([160]Figure 7).
pathway_label = '{} Annotations'.format(pathwaydb)
fig_filename = output_folder + '/GSEA-Dotplot.pdf'
with sns.axes_style("darkgrid"):
dotplot = c2c.plotting.pval_plot.generate_dot_plot(pval_df=pvals,
score_df=scores,
significance=0.05,
xlabel='',
ylabel=pathway_label,
cbar_title='NES',
cmap='PuOr',
figsize=(5, 12),
label_size=20,
title_size=20,
tick_size=12,
filename=fig_filename
)
Figure 7.
[161]Figure 7
[162]Open in a new tab
Assigning functions to factors from GSEA
By using the loadings of ligand-receptor pairs per factor, they can be
ranked within a factor (factor-specific analysis), and this information
can be used to run an enrichment analysis such as GSEA to associate
each of the programs with different functions or pathways. This dot
plot shows the enriched KEGG pathways per factor. Dot size indicates
the –log(p value), while the color indicates the normalized enrichment
score (NES) from the GSEA.
Footprint enrichment analysis (timing: <1 min)
In footprint enrichment analysis, instead of considering the genes
whose products (proteins) are directly involved in a process of
interest, we consider the genes affected by it—i.e., those that change
downstream as a consequence of the process.[163]^45 In this case, we
will use the PROGENy resource to infer the pathways driving the
identified context-dependent patterns of ligand-receptor pairs. PROGENy
was built in a data-driven manner using perturbation data.[164]^44
Consequently, it assigns different weights to each gene in its pathway
gene sets according to its importance. Thus, we need an enrichment
method that can account for weights. To do so, we will use a
multivariate linear regression implemented in decoupler-py.[165]^43
As we did in GSEA using Tensor-cell2cell, we first have to generate
ligand-receptor gene sets while also assigning a weight to each
ligand-receptor interaction. This is done by taking the mean between
the ligand and receptor weights. For ligand and receptor complexes, we
first take the mean weight for all subunits. We keep ligand-receptor
weights only if all the proteins in the interaction are sign coherent
and present for a given pathway. Load the PROGENy gene sets and then
convert them to sets of weighted ligand-receptor pairs.
# We first load the PROGENy gene sets
net = dc.get_progeny(organism='human', top=5000)
# Then convert them to sets with weighted ligand-receptor pairs
lr_progeny = li.rs.generate_lr_geneset(lr_pairs, net, lr_sep="^")
Run footprint enrichment analysis using the “mlm” method from
decoupler-py:
="^")
estimate, pvals = dc.run_mlm(lr_loadings.transpose(),
lr_progeny,
source="source",
target="interaction",
use_raw=False
Here, “estimate” and “pvals” correspond to the t values and p values
assigned to each pathway.
Finally, we generate a heatmap for the 14 pathways in PROGENy across
all factors ([166]Figure S3A).
fig_filename = output_folder + '/PROGENy.pdf'
_ = sns.clustermap(estimate,
xticklabels=estimate.columns,
cmap='coolwarm',
z_score=4)
plt.savefig(fig_filename, dpi=300, bbox_inches='tight')
From the heatmap, we can also generate a bar plot for the PROGENy
pathways for a specific factor ([167]Figure S3B).
selected_factor = 'Factor 5'
fig_filename = output_folder +
'/PROGENy-{}.pdf'.format(selected_factor.replace(' ', '-'))
dc.plot_barplot(estimate,
selected_factor,
vertical=True,
cmap='coolwarm',
save=fig_filename)
Discussion
In this protocol, we illustrate how LIANA and Tensor-cell2cell can be
used together to provide robust and flexible solutions to infer CCC
programs across contexts. In addition to established methods for
studying ligand-receptor interactions[168]^19^,[169]^32 that LIANA also
includes, approaches geared toward the systematic inference of CCC
programs across diverse conditions are less common. A few of them, such
as CellChat,[170]^19 summarize pathway-focused similarities across
conditions based on pairwise comparisons, while MultiNicheNet[171]^20
depends on differential expression analysis and requires a hypothesis
to be defined a priori. MultiNicheNet was recently proposed to
systematically identify deregulated CCC interactions along with
associated intracellular signaling. MultiNicheNet uses a flexible
statistical framework and is capable of handling complex experimental
designs. However, MultiNicheNet depends on differential expression
analysis and hence requires a predefined hypothesis. As such, we see
MultiNicheNet and Tensor-cell2cell as complementary, since the latter
can identify patterns across all cell types and conditions in an
untargeted manner. An analogous strategy to Tensor-cell2cell can be
adopted by using factor analysis[172]^11 in LIANA to identify patterns
directly from the CCC scores.[173]^46 Hence, Tensor-cell2cell and LIANA
can help researchers to generate a specific hypothesis and identify
cell types to later use MultiNicheNet as a downstream analysis to
additionally infer intracellular signaling triggered by key ligands.
Since our pipeline is intended as a generalizable approach for use with
many different resources and methods, we additionally assessed the
robustness of our results across different inputs. Specifically, we
showed how communication scores may be different for individual samples
across methods (see Tutorial 02 in the online tutorials), whereas those
differences may be mitigated by using the consensus score or when
running Tensor-cell2cell across multiple samples (see Python
Supplementary Tutorials S3A-2 and S3B in the online tutorials).
Moreover, we provide an in-depth assessment of Tensor-cell2cell’s
sensitivity to missing values and batch effects ([174]STAR Methods).
Additional benchmarks can be found in the original
Tensor-cell2cell[175]^12 and recent LIANA+[176]^46 articles, where we
have shown that Tensor-cell2cell consistently captures CCC events
deregulated across diverse contexts and conditions. Finally, we
demonstrate the broad applicability of our protocol by also providing
an example of defining contexts to analyze CCC using spatial
transcriptomics (see [177]STAR Methods and Python Supplementary
Tutorial S4 in the online tutorials). Although the example using
spatial transcriptomics presented in our extended tutorials is a
simplified application of the concept, it could be extended to compare
multiple samples if users are able to align tissues from different
donors. Similarly, our protocol can also aid users in applications
beyond single-cell transcriptomics data, including extracting
metabolite-mediated CCC programs[178]^27 or similar extensions to
multiomics data.[179]^46
Limitations of the study
Similar to any other approach to infer CCC from transcriptomics data,
our protocol also inherits assumptions leading to certain limitations.
These include the assumption that gene co-expression is indicative of
active signaling events, which are largely mediated by proteins and
their interactions, while also disregarding multiple biological
processes, such as protein translation, post-translational
modifications, secretion, diffusion, and trigger of intracellular
events, that precede and follow the interaction itself.[180]^2^,[181]^5
Moreover, the aggregation of single cells into cell groups is essential
when inferring potential CCC events, which could occlude some signals
in heterogeneous tissues,[182]^2^,[183]^3 thereby biasing the insights
that can be obtained. Furthermore, the input of Tensor-cell2cell is a
4D tensor, so it requires that all elements be measured across all
features and samples (i.e., cell types and genes expressing ligands and
receptors). Consequently, one should consider how to handle missing
values across samples that do not capture the same cell types and/or
expressed genes. Deciding whether those reflect biologically meaningful
zeros or a technical artifact may lead to variations in the resulting
CCC programs. We provide an extended explanation of the related
parameter choices that may help users decide how to handle this
challenge ([184]STAR Methods).
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
COVID BALF single-cell RNA-seq dataset Liao et al.[185]^30 GEO:
[186]GSE145926; Zenodo Data:
[187]https://doi.org/10.5281/zenodo.7706962
PBMC single-cell RNA-seq dataset Kang et al.[188]^47 GEO:
[189]GSE96583; Zenodo Data:
[190]https://doi.org/10.5281/zenodo.10069528
Myocardial Infarction spatial transcriptomics dataset Kuppe et al.,
2022[191]^22 Zenodo Data: [192]https://doi.org/10.5281/zenodo.6578047
__________________________________________________________________
Software and algorithms
__________________________________________________________________
Protocol source code This paper
[193]https://doi.org/10.5281/zenodo.10700956
Code for benchmarking batch effects and missing values This paper
[194]https://doi.org/10.5281/zenodo.10713331
[195]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources should be directed to
and will be fulfilled by the lead contact, Nathan E. Lewis
(nlewisres@ucsd.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
This paper analyzes existing, publicly available data. These
accession numbers for the datasets are listed in the [196]key
resources table. In particular, the BALF single-cell RNA-seq
dataset is available at [197]https://zenodo.org/record/7706962, the
PBMC single-cell RNA-seq dataset is available at
[198]https://zenodo.org/records/10069528, and the Myocardial
Infarction spatial transcriptomics dataset is available at
[199]https://zenodo.org/record/6578047.
* •
All original code has been deposited at Zenodo and is publicly
available as of the date of publication. DOIs are listed in the
[200]key resources table. Additionally, source code is available at
[201]https://github.com/saezlab/ccc_protocols and can be viewed at
[202]https://ccc-protocols.readthedocs.io/.
* •
Any additional information required to reanalyze the data reported
in this paper is available from the [203]lead contact upon request.
Method details
Computational Infrastructure
All code was ran on a computer with the following specifications.
* •
CPU: AMD Ryzen Threadripper 3960x (24 cores)
* •
Memory: 128GB DDR4
* •
GPU: NVIDIA RTX A6000 48GB
However, the minimal requirements for running this protocol are.
* •
CPU: 64-bit Intel or AMD processor (4 cores)
* •
Memory: 16GB DDR3
* •
GPU: NVIDIA GTX 1050 Ti (Optional)
* •
Storage: At least 10GB available
Timing
Expected timing for this protocol using the dataset in the [204]key
resources table:
Step 1. Installation of Anaconda/Miniconda and Python packages:
5–30 min.
Step 2. Initial setups: ∼1 min.
Step 3. Data preprocessing: 5–7 min.
Step 4. Inferring cell-cell communication with LIANA: ∼5 min.
Step 5. Comparing cell-cell communication across multiple samples with
Tensor-cell2cell: Running selection of number of factors via elbow
analysis and the tensor decomposition takes 5 min with the ‘regular’
pipeline, while the ‘robust’ pipeline takes 40 min.
Step 6. Functional Enrichment Analysis of KEGG and PROGENy pathways
respectively using GSEA and linear regression take 1 min each.
Protocol details
To run our protocol presented in this manuscript and the tutorials
available online ([205]https://ccc-protocols.readthedocs.io/), software
specifications are summarized in the Software Requirements Table. To
facilitate the setup of a virtual environment containing all required
packages with their corresponding versions, we provide an executable
`setup_env.sh` script together with instructions on a Github repository
we prepared for this protocol:
[206]https://github.com/saezlab/ccc_protocols/tree/main/env_setup.
Software Requirements Table
Package Name Package Version Language Install With
jupyter conda
ipywidgets conda
pip ≥22 Python conda
scanpy ≥1.9 Python conda
∗cuda-toolkit conda
∗pytorch-cuda 11.8 conda
∗torchvision conda
∗torchaudio conda
pytorch, ∗cuda enabled conda
scvi-tools ≥0.18 Python conda
scikit-misc 0.1.4 Python conda
cell2cell 0.7.3 Python pip
liana 1.0.3 Python pip
decoupler 1.5.0 Python pip
omnipath 1.0.7 Python pip
plotnine ≥0.12.4 Python pip
seaborn 0.11.2 Python pip
statannotations 0.5.0 Python pip
matplotlib 3.7.3 Python pip
singlecellexperiment R conda
remotes ≥2 R conda
devtools ≥2 R conda
seuratobject R conda
biocmanager ≥1.30 R conda
seurat ≥4 R conda
hd5r R conda
furrr R conda
textshape R conda
forcats R conda
rstatix R conda
ggpubr R conda
scater R conda
zellkonverter R conda
liana 0.1.13 R remotes
seurat-disk 0.0.0.9020 R remotes
decoupleR 2.3.3 R biocmanager
[207]Open in a new tab
∗: For GPU enabled use only.
Python packages should always be installed. R language packages only
need to be installed if planning to run the notebooks in R.
Advice to deal with potential issues running this protocol, either in
its original or personalized forms, is summarized in the
Troubleshooting Table.
Troubleshooting Table.
Step Problem Possible reason Solution
3 & 4 Error: Expression matrix contains non-finite values (nan or inf)
Warning: Make sure that normalized counts are passed Mishandling counts
processing Ensure that the matrix containing normalized counts is
passed. Replace nan and inf values by zeros.
4.1 Negative values in LIANA outputs Using preprocessed data with
negative expression values. Avoid using preprocessing methods that
generate negative values (e.g., centering the data to the mean values,
using batch-corrected expression values, etc.).
4.2 Not enough ligand-receptor pairs in the data for the analysis
Mismatched symbol IDs LIANA by default uses a resource with gene symbol
IDs. When working with e.g., Ensembl IDs users need to provide an
external resource; see
[208]https://ccc-protocols.readthedocs.io/en/latest/notebooks/ccc_pytho
n/02-Infer-Communication-Scores.html
5.1 CCC scores representing opposed importance When using
‘magnitude_rank’ scores from LIANA, lower values are more important.
However, Tensor-cell2cell prioritizes high values as the important
ones. Build the 4D tensor using an `inverse_fun` to make lower values
to be the most important scores.
5.2 Rank selection through the elbow analysis is not behaving properly
High sparsity or number of missing values in the tensor Re-run LIANA
with less stringent parameters (e.g., smaller expr_pror). Re-build the
tensor with more strict how parameters (e.g., using how = ‘inner’ or
increasing outer_fraction).
5.3 Visualization of loadings are not properly displayed in heatmaps
Too many or few elements in the dimension to visualize To visualize all
elements, use the parameter `loading_threshold = 0′ to create the
heatmaps. If you have too many elements, you can prioritize those with
high loadings, so a threshold can be set. E.g., `loading_threshold =
0.1′
[209]Open in a new tab
Benchmarking batch effects and missing values
To help users make informed decisions regarding choices in their
computational pipeline, we benchmarked two key factors that can
influence Tensor-cell2cell′s outputs: batch effects and missing data
(which result in missing tensor indices) across samples. For
comprehensive details on the motivation, methods, and results of this
benchmarking, please see the online description.[210]^48
Here we describe our pipeline for both the Missing Indices and Batch
Effects benchmarking simulations. All associated code can be found in
the following repository:
[211]https://github.com/hmbaghdassarian/tc2c_benchmark. For downstream
analyses, unless otherwise specified, all linear regressions were
performed using a generalized linear model (GLM) with an identity link
function; multivariate regressions with >1 independent variable were
combined additively and do not include interaction terms. Additionally,
all p-values were multiple-test-corrected using the Benjamini-Hochberg
(BH) method to control for false discovery rates (FDRs).
We simulated single-cell RNA-sequencing expression data using
Splatter,[212]^49 adapting a previously described computational
approach.[213]^50 We generated a single-cell expression matrix
containing 2000 genes and 5000 cells evenly distributed across 6 cell
types and 5 samples. Each sample represents a context.
Next, for each sample, we applied quality control filters to the cells
and genes as implemented previously.[214]^50 Briefly, low-quality cells
were identified and filtered using the scuttle package based on
standard metrics (mitochondrial fraction, library size, and number of
genes detected); genes detected in fewer than 1% of cells are
discarded. Next, counts were normalized using scran pooling[215]^51 and
a log[+1] transformation. For batch-effect benchmarking, batch
correction was further implemented; Scanorama[216]^52 was run on the
log-normalized counts matrix and scVI[217]^53 was run on the raw counts
matrix.
From the expression counts matrices, a random subset of 200 genes were
chosen to simulate a ligand-receptor interaction network as previously
described.[218]^12 Briefly, we use StabEco’s[219]^54 BiGraph function,
with the power law exponent value set to 2 and the average degree value
set to 3, to generate a scale-free, directed, bipartite network of the
200 genes. Half the genes were assigned to be ligands and the other
half to be receptors. Not all genes were part of the connected network
(70/200), and these were excluded from downstream analyses. This
interaction network was used as custom ligand-receptor resource input
to LIANA’s cell-cell communication scoring.
Then, 4D-Communication tensors were built from the output of LIANA as
described in our protocol. To generate missing indices in the
4D-Communication Tensor, we iteratively omitted a random subset of
genes or cell types from the expression data. Specifically, we iterated
through combinations of the following two variables: the fraction of
cell types to remove in a given sample (16, 13, 12, and 23), the
fraction of genes (within the 130 in the simulated LR interaction
network) to remove in a given sample (110, 310, 12), and the fraction
of samples to apply these omissions to (15, 25, 23). We compared this
to a gold-standard tensor with no missing indices.
We compared decomposition outputs using CorrIndex[220]^55 as previously
described.[221]^12 Briefly, the CorrIndex represents a dissimilarity
between decomposition outputs and lies between 0 and 1; we convert this
to a similarity metric by using (1-CorrIndex).
For batch-correction, iterating across increasing levels of batch
severity, we generated four counts matrices.
* (1)
Gold-standard: a processed counts matrix with no batch effects
* (2)
Log-normalized: a processed counts matrix with batch effects
present
* (3)
Scanorama batch-corrected: a processed counts matrix with batch
effects corrected for using Scanorama
* (4)
scVI batch-corrected: a processed counts matrix with batch effects
corrected for using scVI
We ran the combined LIANA and Tensor-cell2cell pipeline on each of
these counts matrices. Finally, we assessed the similarity between each
of the decomposition outputs as follows.
* •
Log-normalized similarity: Similarity between Tensor-cell2cell′s
decomposition output from the log-normalized counts matrix (2) and
that of the gold-standard input (1)
* •
Scanorama similarity: Similarity between Tensor-cell2cell′s
decomposition output from the Scanorama batch-corrected counts
matrix (3) and that of the gold-standard input (1)
* •
scVI similarity: Similarity between Tensor-cell2cell′s
decomposition output from the scVI batch-corrected counts matrix
(4) and that of the gold-standard as input (1)
Additionally, for batch correction benchmarking, each counts matrix was
quantified for its level of batch severity using two previously applied
metrics[222]^50^,[223]^56: (1) kBET,[224]^57 is an inverse measure of
“mixability”, or the extent to which batch effects are removed, and (2)
normalized mutual information (NMI) between cell type identity and
cluster identity - a measure of “clusterability”, or the extent to
which biological variation is conserved. For the clusterability metric,
we subtracted the NMI from 1 to quantity batch severity. In this
manner, both mixability and clusterability ranged between 0 and 1, with
increasing values indicating increasing batch severity. Clusterability
was assessed using both k-means clustering[225]^58 and Louvain
clustering.[226]^59
Batch severity does not affect the results of our pipeline. We saw that
the gold-standard matrix performed as expected, showing clear Louvain
clusterability and little-to-no mixability. The log-normalized matrix
also performed as expected across all batch severity metrics. While the
batch-corrected counts matrices increased along with the Splatter
parameters on occasion, the increases were overall less severe than
that of the log-normalized matrix ([227]Figure S4). The gold-standard
counts matrices demonstrate comparably low batch severity across all
iterations ([228]Figure S5A). We also saw that across all batch
severity metrics, similarity does not decrease beyond 0.963, indicating
that Tensor-cell2cell is robust to batch effects ([229]Figure S5B).
Furthermore, we evaluated whether the fraction of negative counts is a
confounder of batch severity ([230]Figures S5C–S5E). The fraction of
negative counts does not substantially affect the Scanorama similarity
as indicated by the small regression coefficient estimate and
insignificant p-value ([231]Figure S5F). This tells us that using batch
correction methods that introduce negative values and simply replacing
those with 0 prior to running communication scoring can be appropriate
for recovering biological signals from Tensor-cell2cell.
If batch correction improves decomposition, we would expect
batch-corrected similarity (Scanorama and scVI) to a) score higher than
log-normalized similarity across batch similarity metrics and b)
decrease at a lower rate with increasing batch severity than
log-normalized similarity. Across batch severity metrics, we see that
this tends not to be the case, though all similarity types maintain a
high similarity score across batch severity levels ([232]Figure S6)
Overall, while batch effect correction may not be necessary to recover
biological signals using Tensor-cell2cell, if the user feels it is
important, they can be comfortable in implementing the batch correction
method of choice.
Regarding missing values, we found that there was a significant
decrease in the similarity of Tensor-cell2cell′s output with that of
the gold-standard as the fraction of missing indices increased when
filling both with NaN (masked) or zero (not masked). However, those
that were not masked had a substantially larger decrease in similarity
than those that were ([233]Figure S7A). When considering the two
filling methods in combination with the missing fraction, we see that
similarity is lower by 0.094 on average when filling with zero
([234]Figure S7B). Altogether, our pipeline is robust enough to impute
missing values and sensitive enough to handle true biological zeros.
Quantification and statistical analysis
Notations for the scoring functions in LIANA
[MATH: k :MATH]
is the k-th ligand-receptor interaction
[MATH: L :MATH]
- expression of ligand L
[MATH: R :MATH]
- expression of receptor R
[MATH: C :MATH]
- cell cluster
[MATH: i :MATH]
- cell group i
[MATH: j :MATH]
- cell group j
[MATH: M :MATH]
- the library-size normalized and log1p-transformed gene expression
matrix
[MATH: X :MATH]
- normalized gene expression vector
We denote the two interaction proteins, via their genes
[MATH: L :MATH]
&
[MATH: R :MATH]
, yet we use this for convenience as these can also denote the
interaction of any other event category, such as those between
membrane-bound or extracellular matrix proteins. Furthermore, in the
case of heteromeric complexes
[MATH: L :MATH]
&
[MATH: R :MATH]
denote the summarized expression of the complex.
CellPhoneDBv2[235]^32 function.
Magnitude: 1)
[MATH:
LRmeank,i,j<
/mrow>=LC
mi>i+RCj2 :MATH]
Specificity: A permutation approach also adapted by other methods, see
4)
Geometric Mean function.
Magnitude: 2)
[MATH:
LRgeometric.
meank,i,j=L<
mi>Ci·RCj :MATH]
Specificity: An adaptation of CellPhoneDB’s permutation approach; see
4)
CellChat’s[236]^19 LR probabilities∗ function.
Magnitude: 3)
[MATH:
LRprobk,i,j<
/mrow>=TriMean(Lci
)·TriMean(Rcj
)Kh+TriMean
(Lci
)·TriMean(Rcj
) :MATH]
where Kh = 0.5 by default and `TriMean` represents Tuckey’s Trimean
function:
[MATH:
TriMean
(X)=Q0.25
msub>(X)+2·Q0.5<
/mn>(X)+Q0.75(X)4
:MATH]
Specificity: An adaptation of CellPhoneDB’s permutation approach; see
4)
∗Note: The original CellChat implementation uses information of
mediator proteins (e.g. activators and inhibitors) and signaling
pathways, which is specific to the CellChat resource. Since LIANA
allows combining any resource with different scoring methods, LIANA
does not utilize this information, and hence the implementation of
CellChat's scoring function in LIANA was simplified to be
resource-agnostic.
[MATH:
p−valuek,i,<
mi>j=1P∑p=1P[fun<
/mrow>permuted(LCi∗,RC
mi>j∗)≥fun
mrow>observ
ed(LCi∗,RC
mi>j∗)] :MATH]
Equation 4
where
[MATH: P :MATH]
is the number of permutations, and
[MATH: L∗ :MATH]
and
[MATH: R∗ :MATH]
are ligand and receptor expressions summarized according to the
aggregation function per cluster used by each method, i.e., by default
the arithmetic mean for CellPhoneDB and Geometric Mean, and TriMean for
CellChat.
SingleCellSignalR[237]^35 function.
Magnitude: 5)
[MATH:
LRscorek,i,<
mi>j=L
CiRCj
LCiRC<
mi>j+μ :MATH]
where
[MATH: μ :MATH]
is the mean of the expression matrix
[MATH: M :MATH]
NATMI[238]^34 function.
Magnitude: 6)
[MATH:
LRproductk,<
mi>i,j=LCi
RCj<
/mrow> :MATH]
Specificity: 7)
[MATH:
SpecificityWeigh
mi>tk,i,j
=LCi∑nL
Ci·<
mfrac>RCi∑nRCj :MATH]
Connectome[239]^33 function.
Magnitude: 6)
[MATH:
LRproductk,<
mi>i,j=LCi
RCj<
/mrow> :MATH]
Specificity: 8)
[MATH:
LRz.meank,i<
mo>,j=zL
mi>Ci+z
RCj2 :MATH]
where
[MATH: z :MATH]
is the Z score of the expression matrix
[MATH: M :MATH]
Log2FC function.
Specificity: 9)
[MATH:
LRlog2FCK,I<
mo>,J=Log2FC
Ci,L+Log2FC
Cj,R<
mn>2 :MATH]
where log2FC for each gene is calculated as:
[MATH: log2FC=log2(mean(Xi))−log2(mean(Xno<
mi>ti)) :MATH]
Equation 10
Rank Aggregate function.
When generating a consensus from the different methods in LIANA, a rank
aggregate[240]^36 is calculated for the magnitude and specificity
scores from the methods separately. First, a normalized rank
matrix[0,1] is generated separately for magnitude and specificity as:
[MATH:
ri,j=ranki,
jmax(ranki,
j)(1
≤i≤m,1≤j≤n)
:MATH]
Equation 11
where
[MATH: m :MATH]
is the number of ranked score vectors,
[MATH: n :MATH]
is the length of each score vector (number of interactions),
[MATH: ranki,
j :MATH]
is the rank of the j-th element (interaction) in the i-th score rank
vector, and
[MATH: max(ran<
mi>ki) :MATH]
is the maximum rank in the i-th rank vector.
For each normalized rank vector
[MATH: r :MATH]
, we then ask how probable it is to obtain
[MATH:
r(k)null<=r(k
) :MATH]
, where
[MATH:
r(k)null
:MATH]
is a rank vector generated under the null hypothesis. The
RobustRankAggregate ([241]https://github.com/cran/RobustRankAggreg)
method expresses the probability
[MATH:
r(k)null<=r(k
) :MATH]
as
[MATH:
βk,n(r) :MATH]
through a beta distribution. This entails that we obtain probabilities
for each score vector
[MATH: r :MATH]
as:
[MATH: p(r)=βk
mi>,n<
mn>1,⋯,nmin(r
)∗n :MATH]
Equation 12
where we take the minimum probability
[MATH: ρ :MATH]
for each interaction across the score vectors, and we apply a
Bonferroni correction to the p-values by multiplying them by
[MATH: n :MATH]
to account for multiple testing.
For all the methods above, LIANA considers interactions as occurring
only if the ligand and receptor, and all of their subunits, are
expressed in a certain proportion of the cells (0.1 by default) in both
clusters involved in the interaction. This can be formulated as an
indicator function as follows:
[MATH: I{LCjexpr.p<
/mi>rop≥0.1andRCj
msub>expr.p
rop≥0.1} :MATH]
Acknowledgments