Abstract
The gene regulatory network inference method based on bulk sequencing
data not only confuses different types of cells, but also ignores the
phenomenon of network dynamic changes with cell state. Single cell
transcriptome sequencing technology provides data support for
constructing cell type and state specific gene regulatory networks.
This study proposes a method for inferring cell type and state specific
gene regulatory networks based on scRNA-seq data, called inferCSN.
Firstly, inferCSN infers pseudo temporal information from scRNA-seq
data and reorders cells based on this information. Because of the
uneven distribution of cells in pseudo temporal information, the
regulatory relationship tends to lean towards the high-density areas of
cells. Therefore, based on the cell state, we divide the cells into
different windows to eliminate the temporal information differences
caused by cell density. Then, a sparse regression model, combined with
reference network information, is used to construct a cell
type-specific regulatory network (CSN) for each window. The
experimental results on both simulated and real scRNA-seq datasets show
that inferCSN outperforms other methods in multiple performance
metrics. In addition, experimental results on datasets of different
types (such as steady-state and linear datasets) and scales (different
cell and gene numbers) show that inferCSN is robust. To further
demonstrate the effectiveness and application prospects of inferCSN, we
analyzed the gene regulatory network of T cells in different states and
different tumor subclons within the tumor microenvironment, and we
found that comparing the regulatory networks in different states can
reveal immune suppression related signaling pathways.
Subject terms: Computational biology and bioinformatics, Systems
biology
Introduction
Constructing and mining gene regulatory network are crucial for
explaining biological phenomena such as cell differentiation and cell
fate, and can even be used to discover the immune escape pathways
adopted by tumor cells for evading immune system surveillance. For
example, how tumor cells regulate the high expression of PD-L1 to
inhibit cytotoxic T lymphocytes (CTL) cell killing. However, the
expression level of the target gene in traditional bulk sequencing data
is the average of its expression in different types and states of
cells, which not only confuses different types of cells in gene
regulatory network (GRN) construction methods, but also ignores the
dynamic changes of the network with cell state, resulting in a large
number of false positive or false negative edges in the regulatory
network. Consequently, such defects will lead to confusion in
understanding downstream biological phenomena.
Single-cell RNA sequencing (scRNA-seq) technology distinguishes
different types of cells, and even different states of the same cell
type, with unprecedented resolution, which is crucial for understanding
dynamic and complex cellular processes^[40]1,[41]2. The use of
scRNA-seq data can construct a GRN that is specific to cell type and
even cell state. This is crucial for understanding the complex
interactions between tumor cells and immune cells during tumor
occurrence, development, and metastasis, and can guide researchers in
identifying new drug targets and developing effective immunotherapy
drugs. However, due to the high noise (low signal-to-noise ratio, SNR),
high dimensionality, and a large number of dropouts of scRNA-seq data,
constructing regulatory networks based on scRNA-seq data still faces
challenges.
So far, many regulatory network methods have been proposed, and the
datasets used for network inference can be divided into: methods based
on bulk sequencing transcriptome data^[42]3, methods based on scRNA-seq
data^[43]4, and methods based on multi-source and multi-omics data. The
method based on bulk sequencing transcriptome data, such as
GENIE3^[44]5 based on random forest model, combines the known
relationships of TFs and target genes in the curated database, and uses
the expression of TF-encoding gene to predict target gene expression to
eliminate false positive edges. The main drawback of GENIE3 is its
neglect of cellular heterogeneity and high false positive rate; Methods
based on scRNA-seq data, such as SCENIC^[45]6,[46]7 using TF and target
gene coexpression patterns to generate cell type specific GRNs, and
then further pruning the edges of the network using the prior
information. The regulatory relationship between TF and genes will
change over time. However, due to the scarcity of temporal data in
scRNA-seq, this kind of method ignores the highly dynamic nature of
regulatory networks. Fortunately, studies have shown that cell
differentiation processes can be modeled and inferred by calculating
pseudo-time information of cells, which means that the pseudo-time
information can be used to accurately construct GRN^[47]8. For example,
LEAP^[48]9 and SINCERITIES^[49]10 both use pseudo-temporal sorting to
infer the directionality between genes in GRN. The LEAP method defines
a fixed size pseudo time window and assumes that genes expressed
earlier within one time window may affect the expression of other
genes. Then, the GRN is constructed by calculating the Pearson
correlation coefficients (PCC). SINCERITIES^[50]10 uses Kolmogorov
Smirnov (KS) distance to quantify the distance between two cumulative
distribution functions of gene expression and subsequent time points,
and infers the directional regulatory relationships between genes
through ridge regression and partial correlation analysis.
BiXGBoost^[51]11 effectively utilizes temporal information and
integrates XGBoost regression models to evaluate feature importance.
Randomization and regularization are also applied in BiXGBoost to solve
overfitting problems. Normi^[52]12 employs a fixed size sliding window
method throughout the entire trajectory to address these issues, and
applies an average smoothing strategy to the gene expression of cells
in each window to obtain representative cells. In addition, Normi
determines the optimal time delay for each gene pair through distance
correlation, resulting in good robustness. However, there is still
significant room for improvement, such as not taking into account the
distribution of cell density over time^[53]13, and simply treating the
gene expression profiles of all cells as an expression matrix without
considering cell types. Especially in some cases, their performance is
not significantly better than that of random networks^[54]14; SCORPION
uses a messagepassing algorithm to construct comparable GRNs which are
suitable for population-level comparisons by leveraging the same
baseline priors^[55]2; scMGATGRN is a kind of method based on deep
learning model, which uses a multiview graph attention network to
extract local feature information and high-order neighbor feature
information of nodes in the GRN^[56]15. Transcriptome data cannot
directly reflect TF protein abundance and DNA binding events,
interactions between TFs and cofactors, etc. Therefore, methods based
on multi-omics data integrate additional information during the
construction of CSN to improve the accuracy of network construction.
For example, Li et al. proposed a GRN construction method based on
multi-omics data such as transcriptome and epigenomics data^[57]16;
LINGER uses a neural network to infer GRN based multiome data data,
whcih incorporates atlas-scale external bulk data across diverse
cellular contexts and prior knowledge of transcription factor
motifs^[58]17. Note that it is difficult to obtain other omics (such as
scATAC-seq, snmC-seq, and lncRNA-data) that can be paired with
scRNA-seq data, and there may be significant technical and biological
noise between different omics of single-cell sequencing data.
Therefore, the construction of GRN based on single-cell multi-omics
data has encountered certain challenges.
In order to address such challenges, we proposed a cell type and cell
state specific GRN inference method called inferCSN based on scRNA-seq
data.
Figure [59]1 shows the main workflow of inferCSN, which includes:
preprocessing scRNA-seq data (Fig. [60]1a), then combining pseudo time
inference and GAMs to obtain the gene expression matrix containing cell
annotation information and pseudo time information (Fig. [61]1b);
Calculating the intersection point between two cell states based on the
density of pseudo temporal information and then dividing all cells into
multiple windows based on the intersection point; Constructing GRN in
each window based on L0 and L2 regularization model (Fig. [62]1c, d);
Building a reference network and using it to calibrate the cell type
and state specific GRN (Fig. [63]1e–g). Experimental results on
multiple simulated and real datasets show that: Firstly, the inferCSN
has significant advantages in accuracy, efficiency, and robustness;
Secondly, it can construct the GRN of T cells in different states
within the tumor microenvironment at the single-cell level, and compare
the regulatory networks in different states can reveal immune
suppression related signaling pathways; Thirdly, it can construct the
GRNs of different tumor subclons and comparing the regulatory networks
to identify the immune escape pathways dominated by different subclons,
which helps to achieve precise immunotherapy. Finally, inferCSN can
help investigate the development trend of immunosuppressive molecules
of immune cells and tumor cells and test their correlations as the
tumor evolves.
Fig. 1. The workflow of inferCSN.
[64]Fig. 1
[65]Open in a new tab
a Preprocessing of scRNA-seq data. b Pseudotime inference and
generalized additive model are used to analyze the preprocessed gene
expression data to obtain a gene expression matrix containing cell
annotation information and pseudotime information. c, d The
intersection of two different cell states is calculated based on the
density of pseudo-time information, and all cells are divided into
multiple windows based on these intersections. In each window, a sparse
regression model with L0 and L2 regularization is used to construct a
gene regulatory network (GRN). e–g Reference networks are constructed
and used to calibrate cell type and state specific gene regulatory
networks.
Results
To demonstrate the superiority of inferCSN, we conducted different
experimental analyses from four perspectives: (1) comparing it with
other methods to verify that our method is more accurate, efficient,
and biologically interpretable in GRN construction; (2) we
distinguished the different states of T cells and constructed GRNs
corresponding to different states. Then, we compared GRNs of different
states to explore whether inferCSN helps to discover signaling pathways
related to immune suppression; (3) Subclonal differentiation was
performed on tumor cells, and GRNs corresponding to different subclons
were constructed. Then, the GRNs of different subclons were compared to
explore whether inferCSN can help discover signaling pathways related
to immune escape; (4) Simultaneously analyzing T cells and tumor cells
in patients at the same clinical stage to observe whether there is a
correlation between the two in suppressing the immune system.
Construction accuracy of inferCSN
First of all, we compared inferCNS with other popular methods, such as
GENIE3^[66]5, SINCERITIES^[67]10, PPCOR^[68]18, LEAP^[69]9, and
SCINET^[70]19, based on simulated and real datasets, respectively.
Among these methods, GENIE3, SINCERITIES, PPCOR, and LEAP only need to
provide gene expression matrix or both gene expression matrix and
pseudo time information. Therefore, the inferCSN is set as a reference
network free mode and compared with other methods on 200 simulated
datasets. Note that SCINET relies on an external database and requires
additional complex processing pipeline, so SCINET does not participate
in computational complexity evaluation. In order to obtain the optimal
AUROC and AUPRC on each dataset, all experiments were conducted on a
server equipped with AMDRyzen9 5950 × 16 CPU, 128 G RAM, and Ubuntu
20.04.5 LTS system. The GENIE3 and inferCSN were set to run on 16 CPU
cores, with other parameters defaulted; PPCOR and SCINET were tested
according to default parameters; LEAP needs a parameter maxLag, which
controls the maximum lag to be considered. According to the
recommendation (do not use values greater than 1/3), 7 values were
tested within the range of [0, 0.33]: 0.05, 0.1, 0.15, 0.2, 0.25, 0.3,
and 0.33, and the optimal parameters were selected; When inputting gene
expression data with time information, SINCERITIES does not require any
parameters. However, when using pseudo time information, according to
the recommendation, the pseudo time values are discretely distributed
to fixed windows, so the size of windows is set to be 5, 6, 7, 8, 9,
10, 15, and 20.
The results on simulated dataset: Table [71]1 shows the average AUROC
of these five methods on different types of simulated datasets, such as
Bifurcating (BF), Cycle (CY), Linear (LI) and long Linear (LL). On the
BF dataset, the average AUROC of inferCSN is slightly lower than PPCOR,
and higher than all other methods on other datasets. “ALL” in Table
[72]1 represents the average AUROC on all 200 simulated datasets.
Table 1.
The AUROC on different types of datasets
Datasets inferCSN GENIE3 SINCERITIES PPCOR LEAP
ALL 0.811 0.792 0.518 0.683 0.661
BF 0.649 0.655 0.499 0.661 0.577
CY 0.774 0.738 0.569 0.464 0.516
LI 0.878 0.841 0.497 0.772 0.711
LL 0.942 0.935 0.506 0.854 0.837
[73]Open in a new tab
Bold font indicates the best performance achieved by the method on a
specific dataset compared to other methods.
The AUROC results of five methods on four types of datasets were
depicted in Fig. [74]2. Note that there is no significant difference in
AUROC among inferCSN, GENIE3, and PPCOR on the BF dataset, while
inferCSN performs significantly better than all other methods on other
datasets. The box plot can reflect the distribution characteristics and
degree of dispersion of results. In Fig. [75]2, the AUROC distribution
of inferCSN is dense on different datasets, indicating that inferCSN
has stable performance in constructing CSNs among different scales
datasets.
Fig. 2. The AUROC results of five methods.
[76]Fig. 2
[77]Open in a new tab
The AUROC results of 5 different methods are based on four different
datasets: bifurcating (BF, a), cycle (CY, b), linear (LI, c) and long
linear (LL, d). Use Wilcoxon test to evaluate the performance of these
methods, with significance levels represented by “ns” and asterisks
“*”. The “ns” represents P > 0.05, “*” represents P < 0.05, “**”
represents P < 0.01, “* * *” represents P < 0.001, “***” represents
P < 0.0001. P > 0.05 indicates that there is no significant difference
between the two methods. The smaller the P value, the more significant
difference between the two methods.
Table [78]2 and Fig. [79]3 show the AUPRC of these five methods on
different types of datasets. Except for the BF dataset, the average
AUPRC of inferCSN is higher than that of other methods. “ALL” in Table
[80]2 represents the average AUPRC on all 200 simulated datasets.
Table 2.
The Average AUPRC on different types of datasets
Datasets inferCSN GENIE3 SINCERITIES PPCOR LEAP
ALL 0.432 0.351 0.182 0.368 0.336
BF 0.359 0.328 0.263 0.455 0.307
CY 0.427 0.301 0.221 0.291 0.283
LI 0.511 0.412 0.177 0.488 0.382
LL 0.433 0.358 0.006 0.238 0.373
[81]Open in a new tab
Bold font indicates the best performance achieved by the method on a
specific dataset compared to other methods.
Fig. 3. The AUPRC results of five methods.
[82]Fig. 3
[83]Open in a new tab
The results of 5 different methods are based on four different
datasets: bifurcating (BF, a), cycle (CY, b), linear (LI, c) and long
linear (LL, d). Use Wilcoxon test to evaluate the performance of these
methods, with significance levels represented by “ns” and asterisks
“*”. The “ns” represents P > 0.05, “*” represents P < 0.05, “**”
represents P < 0.01, “* * *” represents P < 0.001, “***” represents
P < 0.0001. P > 0.05 indicates that there is no significant difference
between the two methods. The smaller the P value, the more significant
difference between the two methods.
Figure [84]2 shows that there is no significant difference in AUROC
between inferCSN and GENIE3 on the BF dataset, but their performance is
significantly better than other methods on other datasets.
Experimental results on real datasets: inferCSN and other methods were
run ten times on real datasets, and the average AUROC and AUPRC were
calculated. The results in Fig. [85]4 show that inferCSN has
significant performance advantages on real datasets.
Fig. 4. The AUROC and AUPRC results on real datasets.
[86]Fig. 4
[87]Open in a new tab
The inferCSN, GENIE3 and other methods were run 10 times on the real
datasets, and the average AUROC and AUPRC results were calculated. The
AUROC results are shown in a, and the AUPRC results are shown in b.
Robustness evaluation of inferCSN
The number of cells may have certain impact on the GRN construction
accuracy. To test the impact of cell number on accuracy, AUROC and
AUPRC were compared for each method using simulated datasets containing
100, 200, 500, 2000, and 5000 cells. 40 benchmark datasets were tested
for each cell count, and the average of AUROC and AUPRC were depicted.
Figure [88]5 shows the average AUROC and AUPRC for all methods on these
datasets. It can be observed that compared to other methods, inferCSN
achieved the highest AUROC and AUPRC on all datasets, and was able to
robustly construct CSNs on datasets of various scales.
Fig. 5. The AUROC and AUPRC of five methods on datasets with different
numbers of cell.
[89]Fig. 5
[90]Open in a new tab
The AUROC and AUPRC of each method were compared using simulated
datasets containing 100, 200, 500, 2000, and 5000 cells. AUROC and
AUPRC results are presented in panels a and b, respectively.
Computational complexity of inferCSN
The number of cells and the number of genes are the key factors
affecting the computational efficiency of the GRN construction method.
In order to analyze the effect of the number of cells on the running
time, each dataset was fixed at 3,000 genes, and the number of cells
was set to 2000, 4000, 6000, 8000, and 10,000, respectively. In order
to analyze the effect of the number of genes on the running time, each
dataset was fixed at 10,000 cells, and the number of genes was tested
to be 600, 1200, 1800, 2400, and 3000 genes. The inferCSN and GENIE3
enable CPU multithreading acceleration on 24 CPU cores. Each method was
run 10 times, resulting in an average running time as shown in Fig.
[91]6. Compared to the GENIE3 method, inferCSN is more efficient as the
number of cells and genes increases. Only some of the time points for
the GENIE3 method are shown in the figure because the GENIE3 method
takes too long to construct GRNs as the size of the dataset increases.
Although both inferCSN and SINCERITIES construct CSNs based on
regression models, there is a significant difference in the efficiency
of these two methods. The reason why the running time of SINCERITIES is
not shown is because the it takes too long to handle such large-scale
datasets. These results further validate the effectiveness of adding
the L2 regularization term to inferCSN model, which can effectively
address the high-dimensionality and sparsity characteristics of
scRNA-seq data. In addition, the inferCSN also supports specifying TFs
lists, which greatly reduces the computational complexity. The PPCOR is
a kind of univariate correlation method, and thus also has a great
advantage in terms of running time, but it is less robust, especially
when dealing with datasets of different types. To further demonstrate
the applicability of inferCSN to large-scale datasets, we validated its
accuracy and computational performance on datasets with 20,000 cells
and 5000 genes, and the results are shown in Supplementary Table [92]1.
Fig. 6. Running time of five methods on datasets with different cell and gene
numbers.
[93]Fig. 6
[94]Open in a new tab
a each dataset is fixed to 3000 genes, and the number of cells is 2000,
4000, 6000, 8000, and 10,000. b each dataset is fixed to 10,000 cells,
and the number of test genes is 600, 1200, 1800, 2400, and 3000. Each
method was run 10 times, and the average running time was calculated to
obtain the results.
In sum, inferCSN exhibits the best overall performance. Even without
using a reference network, inferCSN still has satisfactory performance.
It is worth noting that SINCERITIES performs poorly in terms of AUROC,
AUPRC, and running time. The main reason may be that the accuracy of
SINCERITIES heavily relies on the accuracy of pseudo time information.
Although the inferCSN also utilizes pseudo time information when
constructing CSN, the strategy of dividing time windows overcomes the
information difference caused by pseudo time density on one hand. On
the other hand, it also reduces the requirement for the quality of
pseudo temporal information, thereby improving the robustness of
inferCSN.
Interpretability of inferCSN
We selected a simulated dataset containing 18 genes and calculated the
AUROC and AUPRC values for each method. Except for SINCERITIES, which
had the lowest AUROC and AUPRC values, all other methods performed
well, as shown in Table [95]3. However, indicators such as AUROC and
AUPRC cannot fully reflect the advantages and disadvantages of the CSN
inference method.
Table 3.
The AUROC and AUPRC results on simulated datasets with 18 genes
inferCSN GENIE3 SINCERITIES PPCOR LEAP
AUROC 0.963 0.945 0.429 0.924 0.886
AUPRC 0.495 0.338 0.046 0.290 0.409
[96]Open in a new tab
Consequently, the networks obtained by different inference methods will
be compared with the gold standard network. In Fig. [97]7, the results
of PPCOR and LEAP differ significantly from the gold standard. GENIE3
and inferCSN perform similarly on AUROC and other indicators, but it
can be seen from Fig. [98]7 that inferCSN constructs a better network.
For example, inferCSN can recognize g1 → g18 as negative regulation,
but GENIE3 mistakenly identifies it as positive regulation. In Fig.
[99]7, it can be observed that the regulatory effect of g18 → g1 does
not exist, but inferCSN identifies it as a negative regulatory effect.
Although inferCSN identifies some incorrect regulatory relationships,
it still has advantages over other methods overall.
Fig. 7. The heat maps of gold network and networks constructed by different
methods.
[100]Fig. 7
[101]Open in a new tab
The Gold is the heat map of the gold standard network. The networks
obtained by other different inference methods (such as inferCSN,
GENIE3, etc.) are compared with the gold standard network.
Both SINCERITIES and inferCSN use regression models, but their
performance shows significant differences. The advantage of inferCSN is
that it does not require complex parameter settings, while SINCERITIES
requires a lot of parameter adjustments, and different parameter
settings have a significant impact on performance. Moreover,
SINCERITIES has very high requirements for the quality of pseudo time
information, and as shown in Fig. [102]7, the network constructed by
SINCERITIES differs significantly from the gold standard network.
Dynamic regulatory network inference and immune regulation analysis of T
cells
Previous studies have shown heterogeneity in infiltrating CD8^+T cells
in the tumor microenvironment (TME), including immature T cells
characterized by CCR7, LEF1, and TCF7, cytotoxic effector T cells
characterized by CX3CR1, PRF1, and KLRG1, and dysfunctional T
cells^[103]20. The obvious feature of dysfunctional T cells is an
increase in coinhibitory receptors, such as PD1, LAG3, TIM3, CTLA4, and
TIGIT. Compared with effector CD8 + T cells, their effector function is
limited. Dysfunctional T cells can be further divided into subgroups,
such as pre dysfunctional, progenitor depleted, and ultimately depleted
T cells.
The state transition of CD8^+T cells in TME during dysfunction is
jointly regulated by TFs and their related cofactors. In order to gain
a deeper understanding of the regulatory mechanisms driving the
transition of CD8^+T cells from immature T cells to dysfunctional T
cells, inferCSN was used to construct CSN on 2508 T cells in different
states. Firstly, we inferred the pseudo temporal information and
combined it with the cell state to divide these T cells into four
windows, and then each window is mainly composed of cells between the
intersection points of two time densities, as shown in Fig. [104]8.
Among them, Window 1 (W1) is mainly composed of initial and
intermediate state cells, W2 is mainly composed of intermediate state
and GZMK marked prefunctional cells, W3 is mainly composed of GZMK
characterized prefunctional cells and ZNF683 prefunctional cells, and
W4 is mainly composed of ZNF683 prefunctional cells and dysfunction
cells.
Fig. 8. The CD8 + T cell window divisions.
[105]Fig. 8
[106]Open in a new tab
Window 1 (W1) was mainly composed of cells in the initial and
intermediate states, W2 was mainly composed of intermediate states and
GZMK-labeled prefunctional cells, W3 was mainly composed of GZMK marked
prefunctional cells and ZNF683 prefunctional cells, and W4 was mainly
composed of ZNF683 prefunctional cells and dysfunctional cells.
Then, we constructed CSNs for each window separately to analyze the
regulatory relationship driving the transition of cell states. After
constructing CSN for each window, a systematic comparison was conducted
on the TFs and gene nodes of CSNs. The results show that there is a
significant overlap of nodes between the CSNs of each window, as shown
in Fig. [107]9. Figure [108]9a shows the overlap status of TFs nodes in
each window, Fig. [109]9b shows the overlap status of gene nodes in
each window, and the red bar chart shows the common nodes of CSN
corresponding to these four windows.
Fig. 9. Overlapping status of TFs and gene nodes in each window.
[110]Fig. 9
[111]Open in a new tab
a the overlap of TFs nodes in each window. b the overlap of gene nodes
in each window.
The Jaccard coefficient (Jaccard ∈ [0,1]) is used to measure the
similarity between two networks. A larger Jaccard coefficient indicates
that the two CSNs are more similar, that is, there are more identical
regulatory relationships in these two networks. We calculated the
average Jaccard coefficient between the CSNs of each window, as shown
in Table [112]4. The results indicate that although the CSNs of the
four windows have a high proportion of duplicate nodes, the average
Jaccard coefficient of all possible intersecting windows is only 0.165.
It means that the regulation relationship in CSNs of different windows
is significantly different, and a large number of reconnections have
taken place in these same nodes, that is, TFs basically reconnect to
different targets along the state transition process of T cells, and
the same TFs dynamically regulate different genes, which further
indicates that there is a complex interaction relationship between
immune cells and tumor cells in TME.
Table 4.
The Jaccard coefficients between different windows
Window Jaccard P value
W1_W2 0.127 2.67e-10
W1_W3 0.153 8.09e-06
W1_W4 0.230 8.03e-09
W2_W3 0.166 0.00e + 00
W2_W4 0.151 5.80e-08
W3_W4 0.164 2.25e-24
[113]Open in a new tab
Under different cellular states, reconnecting regulatory relationships
may alter the network centrality of genes in CSN. Hub genes with high
centrality interact with many other genes in specific cellular
environment, therefore, it is assumed that they are more likely to play
an important role in maintaining specific functions of specific cell
types. Next, we used indicators such as degree, proximity, betweenness,
eigenvalues, and PageRank to evaluate the centrality of nodes. Finally,
the top 60 nodes with the highest ranking were selected as key TFs. The
Mfuzz clustering algorithm is commonly used to identify the dynamic
expression patterns of RNA-seq data with temporal characteristics, and
link the dynamic expression pattern features with gene
function^[114]21. In this study, the Mfuzz clustering algorithm was
applied to cluster the expression patterns of these 60 key TFs using
the default parameter settings. By default, Mfuzz outputs clusters with
the same temporal features, where the temporal features are divided
into four windows. Finally, these 60 TFs are divided into four clusters
with different expression patterns (Fig. [115]10a), as detailed in
Table [116]5.
Fig. 10. Expression patterns and enrichment analysis results of TFs.
[117]Fig. 10
[118]Open in a new tab
a The result of Mfuzz clustering. b The expression pattern of TFs in
Cluster 1 shows an overall upward trend as the cell state changes. c
The pathways enriched by key TFs.
Table 5.
The key transcription factors
CLUSTERS THE CODING GENE OF TF
CLUSTER1 ZNF48, MXD3, KLF13, ID2, SP100, MAF, EOMES, ETV1, MSC, ZNF683,
STAT1, CREM, FOSB, KLF10, HMGA1, TSC22D3, LYAR, IKZF3, HIVEP3, IRF1
CLUSTER2 PHTF2, RBPJ, FOS, KLF6, PHTF1, TRERF1, BATF, IRF7, BCL11B,
TOX, LRRFIP1, TRPS1
CLUSTER3 ZGPAT, IKZF1, HIVEP2, HIF1A, ZNF91, TBX21, RUNX2, STAT4
CLUSTER4 CREB3L2, KLF12, ETS1, HOPX, BHLHE40, ARID4B, ZNF83, NCOR1,
KDM5B, STAT3, PRDM1, ATF6B, ZNF276, ZBTB20, FOXP1, RUNX3, KAT7, NFATC2,
JUNB, SP140
[119]Open in a new tab
According to the results of Mfuzz clustering, the expression patterns
of TFs in Cluster1 showed an overall upward trend with cell state
transition, as shown in Fig. [120]10a, [121]b.
KEGG enrichment analysis was performed on these key TFs, as shown in
Fig. [122]10c, where “n” represents the number of genes in this
cluster. These TFs in Cluster1 mainly involve pathways related to tumor
development, differentiation, and transcription, such as
transcriptional dysregulation in cancer and Th1 and Th2 cell
differentiation pathways, which suggests that the role of these TFs in
T cell state transition may be related to tumor immune escape. Studies
have shown that ZNF48 induces replication stress^[123]22, which is one
of the characteristics of many invasive tumors. ID2 is an important TF
involved in the formation of terminal differentiated T cells^[124]23.
In summary, these results indicate that these TFs play an important
role in the state transition of T cells, and inferCSN effectively
reveals the key TFs and their pathways in T cell state transition,
which is of great significance for a deeper understanding of T cell
function and the regulatory mechanisms of the immune system.
In order to further understand the regulatory mechanism in the process
of T cell dysfunction and more clearly compare the possible regulatory
role of regulatory factors in cell state transition. We visualized the
CSNs derived by inferCSN for 20 key TFs in Cluster 1 across different
windows (Fig. [125]11a–d, corresponding to Windows 1, 2, 3, and 4,
respectively).
Fig. 11. CSNs of 20 key TFs.
[126]Fig. 11
[127]Open in a new tab
“Activation” on the legend indicates that the TFs promote the
expression level of the target gene, while “Repression” indicates that
the TFs inhibit the expression level of the target gene. a–d correspond
to windows W1, W2, W3 and W4 respectively. The expression of 20 key TFs
in Cluster1 in different windows derived from CSN by inferCSN.
From Fig. [128]11, it can be clearly seen that as the cell state
changes, the number of TFs and genes increases in different windows of
CSN, resulting in more complex regulatory relationships.
Dynamic regulatory network inference and immune escape analysis of tumor
cells
Similar to the T cell analysis, inferCSN was used to construct GRNs and
analyze immune escape pathways on 7 tumor cell datasets. For example,
2243 tumor cells were screened from [129]GSE127465, including 486
Alveolar cells, 48 CXCL1 Cancer cells, 640 Pathological Alveolar cells,
46 Proliferating Cancer cells, and 1023 SOX2 Cancer cells. Then,
inferCSN divided these 2243 cells into 4 windows (Supplementary Fig.
[130]1).
Finally, we constructed CSNs for each window to discover key changes in
regulatory relationships between adjacent tumor cell states. By
comparing the TFs and gene nodes of CSNs in various windows, we found
that there is a significant overlap of nodes between the CSNs in each
window, as shown in Supplementary Fig. [131]2.
The results in Supplementary Table [132]2 indicate that some of the
CSNs in the four windows have a high proportion of duplicate nodes, and
the edges between the same nodes undergo a significant amount of
reconnection. This means that during tumor cell state transition, TFs
reconnect to different targets, and the same group of TFs dynamically
regulates different genes. However, starting from window 3, although
the Jaccard coefficients of CSNs in different windows are relatively
large, the P-value is also high, indicating that some false positive
edges may have appeared from window 3. Therefore, it is necessary to
add more additional information to filter the network edges.
Similarly, in order to identify the nodes that play a crucial role in
CSNs, the importance of the nodes is evaluated through metrics such as
PageRank value and node centrality. Finally, the top 10 nodes with
higher rankings were selected as key TFs, and the expression of these
TFs in different windows was depicted in Supplementary Fig. [133]3.
In order to further understand the regulatory mechanisms driving the
development of tumor cells and the regulatory roles of TFs during tumor
cell state transition, we conducted KEGG enrichment analysis on these
10 key TFs (Supplementary Fig. [134]3) and found that these TFs are
involved in some signaling pathways of NSCLC, such as TNF signaling
pathway, GnRH signaling pathway, IL-17 signaling pathway, and PD-L1
checkpoint signal, indicating that these TFs may be related to tumor
immune response. For example, EGR1 plays a crucial role in regulating
cell growth, differentiation, survival, and immune response^[135]24;
NBKBIA^[136]25 regulates the NF kB signaling pathway, controlling cell
apoptosis, inflammatory response, immune response, etc.; YBX1 is an RNA
binding protein that participates in tumorigenesis, regulates a series
of transcription factors, and promotes the tumorigenesis of
NSCLC^[137]26.
Then, some network edges were filtered based on parameters such as edge
weights, and the network connectivity of these TFs was visualized using
inferCSN (Supplementary Fig. [138]4). Stress protein ATF4^[139]27 can
promote tumor cell growth through fibroblasts, which may be a good
target for cancer treatment. ATF4 can regulate the expression of Col1a1
and collagen biosynthesis in perivascular cancer related fibroblasts,
thus supporting angiogenesis and progression of melanoma and pancreatic
cancer. We found that the network edge weight of ATF4 and COL1A1 is
getting less and less in four different windows; SFPQ^[140]20 can bind
to the TXNIP promoter, causing transcriptional inhibition of TXNIP. It
can be observed that the network edge weights between SFPQ and TXNIP
are decreasing in different windows of the network, indicating that
SFPQ may inhibit the transcription of TXNIP.
Correlation analysis of immune regulation between T cells and tumor cells
In order to further understand the immune regulatory mechanism between
T cells and tumor cells, the inferCSN was used to construct CSNs for
tumor cells and immune cells in the same clinical stage. Patients in
the [141]GSE148071 dataset were all in clinical stage III/IV, so
inferCSN was used to further analyze the [142]GSE148071 dataset with a
total of 51912 cells. Because it is difficult to distinguish normal
epithelial cells from malignant tumor cells using gene markers, copy
number variation (CNV) has been applied to identify tumor
cells^[143]28. Firstly, CNV analysis was performed on the dataset using
the CopyKat^[144]29 package. The advantage of using the CopyKat package
is that it can calculate the genome copy variation of a single cell
without setting up a normal cell control group, thereby predicting
whether it is a normal cell (diploid) or a tumor cell (aneuploid).
After that, a total of 4543 tumor cells were screened, including 245
CXCL1 cancer cells, 7 CDKN2A cancer cells, 6 LAMC2 cancer cells, 1083
Proliferating cancer cells, and 3202 SOX2 cancer cells. Next, inferCSN
constructs CSNs of 4543 tumor cells. The inferCSN adopts
Monocle3^[145]8 for pseudo time inference, and combines different types
of tumor cell differentiation trajectories and pseudo time information
to divide these tumor cells into four consecutive windows, as shown in
Supplementary Fig. [146]5. Meanwhile, 470 CD8 + T cells were recognized
from [147]GSE148071.
The changes between the networks of the consecutive windows may help to
identify the key pathway by which normal epithelial cells transform
into cancer cells. In LUAD, the proportion of alveolar cells and
pathological alveolar cells is much higher than in LUSC. LUAD samples
also have a high proportion of LAMC2 and CXCL1 cells. The proportions
of CDKN2A and SOX2 cancer cells in LUSC samples are higher^[148]30.
Compared with LUAD specific cell types (alveolar carcinoma,
pathological alveolar carcinoma, CXCL1 carcinoma), the more abundant
cell types (SOX2, CDKN2A, proliferative carcinoma) in LUSC lag behind
in pseudo time, suggesting that LUAD promotes the stem cell like
phenotype of LUSC cells, so that LUSC cells have stronger invasiveness.
Next, these CSNs are constructed for each window. Then, metrics such as
PageRank and point centrality were used to evaluate the importance of
nodes and identify TFs nodes that play a critical role in CSN. Finally,
13 key TFs were selected, and their expression of these key nodes in
different windows over time was depicted in Supplementary Fig. [149]6.
For example, TFs such as ATF3, FOS, and HES1 are highly expressed in
Window1, and their expression levels also decrease with the change of
windows. However, TFs such as YBX1, SOX2, and SMARCA5 are the
opposite^[150]31,[151]32.
In order to further understand the regulatory mechanisms involved in
the evolution of tumor cells and to more clearly compare the regulatory
roles that TFs may play during tumor cell state transitions, we
examined the regulatory relationships between TFs and genes in each
window, as shown in Supplementary Fig. [152]7.
Further KEGG pathway enrichment analysis was conducted on important
genes in each window (Fig. [153]12), and it was found that some
pathways enriched in different windows were common. For example,
pathways such as cell apoptosis and chemical carcinogenesis indicate
that these pathways are the infrastructure for tumor development and
are involved in different stages of tumor cells. Other pathways,
including TNF signaling pathway (w2, w3, w4), PD-L1 expression in
tumors, PD-1 checkpoint pathway in cancer (w1, w4), and MAPK signaling
pathway (w1, w3), are only enriched in specific windows. These pathways
are all closely related to tumor development^[154]25,[155]33.
Fig. 12. The KEGG enrichment analysis results in different windows of tumor
cells in [156]GSE148071.
[157]Fig. 12
[158]Open in a new tab
The horizontal axis is -LOG10 (FDR). The larger the value, the more
significant the difference, indicating that there are more genes
enriched in the pathway.
T cells are the main target of immunotherapy for NSCLC^[159]34. Based
on the current understanding of CD8^+T cell differentiation, activated
naïve T cells differentiate into different effector T cells and memory
T cells. In tumors, chronic T cell stimulation leads to
dysdifferentiation and functional exhaustion, characterized by loss of
effector function and inhibition of receptor expression^[160]35. Next,
CD8^+T cells were first screened, and the cell status of CD8^+T was
analyzed and calculated using the ProjecTILs^[161]36 package, resulting
in 470 CD8^+T cells in different states. Finally, 35 central memory T
cells (CD8. CM), 122 effector memory T cells (CD8. EM), 21 depleted
precursor T cells (CD8. TPEX), and 288 depleted terminal T cells (CD8.
TEX) were selected for CSN inference of T cells. Then, the pseudo time
values of each CD8^+T cell state were calculated using Monocle3 and
divided into three windows as shown in Supplementary Fig. [162]8.
Use inferCSN to construct CSNs for T cells in each window. Due to the
relatively small number of T cell datasets, only three TFs (JUN,
NFKBIA, ID2) were derived, and their expression patterns in different
windows with time variation were shown in Supplementary Fig. [163]9.
The JUN gene is closely related to metabolism, cell proliferation, and
apoptosis. ID2 plays an important role in regulating gene expression
and response intensity in CD8 + T cells^[164]37.
Supplementary Fig. [165]10 shows that as the window changes, the weight
relationship between the CD69 gene and TFs in CSN gradually decreases,
and the number of TFs that are regulated by CD69 also decreases with
the change of the window. Only two TFs in Supplementary Fig. [166]10c
are regulated by CD69. In addition, it can be seen from Supplementary
Fig. [167]10 that as the pseudo time changes, the number of effector T
cells and memory T cells gradually decreases, indirectly indicating a
decrease in the expression level of CD69.
From Fig. [168]13, it can be observed that as pseudo time passes, the
expression of DUSP1 gene in T cells is positively correlated with JUN
and NFKBIA, but the expression level gradually decreases, while it is
negatively correlated with the expression of ID2. DUSP1 is abnormally
expressed in various human tumors and is associated with the prognosis
of tumor patients. DUSP1 has also been found to play a role in tumor
chemotherapy, immunotherapy, and biological therapy. However, the role
of DUSP1 in different tumor carcinogenesis may be
controversial^[169]38,[170]39. The relationship between the three TFs
and DUSP1 revealed in this study may provide new perspectives for
understanding the tumor immune escape of NSCLC.
Fig. 13. The changing correlations between DUSP1 and 3 TFs.
Fig. 13
[171]Open in a new tab
The vertical axis represents the weight, and the horizontal axis W1,
W2, and W3 represent different windows, showing the weight changes of
three different gene pairs (JUN-DUSP1, ID2-DUSP1, and NFKBIA-DUSP1) in
three different windows.
Finally, enrichment analysis was performed on CSNs from different
windows (Supplementary Fig. [172]11), and it was found that key
pathways related to T cell differentiation and cancer evolution, such
as Th1 and Th2 cell differentiation, T cell receptor signaling
pathways, antigen processing and presentation, Th17 cell
differentiation, and cancer pathways, varied in enrichment at different
windows, indicating changes in T cell state transition.
More importantly, comparing the regulatory analysis results of T cells
and tumor cells, it can be found that as the tumor progresses, there is
a correlation in the immune regulation between T cells and tumor cells.
For example, the two contain some similar signaling pathways, including
the pathways in cancer, T cell differentiation pathway, B cell
differentiation pathway, and cell apoptosis pathway. But the difference
is that the enrichment levels of these pathways are not consistent
between the two (Fig. [173]12, Supplementary Fig. [174]11). The
enrichment degree of pathways in tumor cells in Fig. [175]12 is the
most significant, but in T cells in Supplementary Fig. [176]11, the
enrichment degree is not significant. However, the enrichment degree of
them gradually increases in both T cells and tumor cells as the window
changes.
The tumor cell enrichment analysis in Fig. [177]12 also includes some
immune related signaling pathways such as T cell receptor signaling
pathway and B cell receptor signaling pathway. For the T cell receptor
signaling pathway, its enrichment degree gradually decreases with the
change of window in T cells. The enrichment of T cell signaling
pathways in w1 and w2 is decreasing in tumor cells, but it is
increasing in w3 and w4. These pathways may play different and
important roles in LUAD and LUSC. IL-17 is a key pro-inflammatory
cytokine primarily produced by Th17 cells, highly expressed in patients
with various autoimmune diseases. It is a major pathological cytokine
and therapeutic target for many autoimmune diseases and cancers. In
this study, as T cells developed in different states, the degree of
IL-17 pathway enrichment gradually increased. At present, research has
found that the IL-17 family genes play a promoting role in lung cancer,
but the expression and function of IL-17A and IL-17B in lung cancer
have not been fully elucidated, and larger sample size studies are
needed^[178]40. The enrichment analysis results of different windows of
tumor cells in Fig. [179]12 show that the degree of IL-17 signaling
pathway enrichment gradually increases from w2, indicating that it
plays a promoting role in tumor differentiation. For w1, its signaling
pathway has the highest significance, which means that the expression
level of IL-17 family mRNA may be related to specific NSCLC
histological types. For example, the upregulation frequency of IL-17,
IL-17 b, and IL-17c mRNA in LUSC is lower than that in LUAD^[180]40.
Therefore, it is necessary to pay attention to the relationship between
IL-17 and the progression of NSCLC.
In addition, based on the CSNs of tumor cell and T cell, we found that
some TFs and genes play important roles together, and the presence of
these genes is also found in some common signaling pathways. For
example, the NFKBIA is not only a highly expressed immune checkpoint
related gene in effector T cells, but also the NFKBIA signaling pathway
may play an important role in the progression of tumor cells^[181]41.
Discussion
The inferCSN method proposed in this study outperforms current
mainstream methods in multiple metrics. The comprehensive experimental
results on simulated and real datasets show that inferCSN outperforms
GENIE3 and other methods in accuracy metrics such as AUROC; The
scRNA-seq data usually contains tens of thousands of cells, and
traditional methods are either difficult to process directly or require
a heavy computational burden. Our inferCSN shows significant
improvement in computational efficiency. Compared with the gold
standard network, inferCSN also has certain advantages in terms of
network interpretability and stability. The main reasons why the
inferCSN method can achieve the above advantages are:
1. The inferCSN uses the L0 regularization method to identify the
correlation between target genes and TFs. Experiments have shown
that the L0 regularization method can not only effectively identify
high-order regulatory relationships between multiple TFs and target
genes, but also avoid artificially setting thresholds, which helps
to reduce false positives and significantly improve the accuracy
and sparsity of network inference.
2. Pseudo time trajectories can simulate the dynamic differentiation
process of cells, including temporal changes in gene expression.
Fully utilizing such a dynamic differentiation process can improve
the accuracy and biological significance of regulatory relationship
recognition.
3. After distinguishing cells of different types and states, the
Gaussian kernel density estimation method is used to combine cell
state and pseudo time information, dividing cells into different
windows to eliminate information differences caused by pseudo time
density. The inferCSN is no longer a static model, but a dynamic
CSN inference method, which helps to examine the changing trend of
regulatory network as tumors occur and develop. In addition, the
CSN is specific to cell type and state.
4. The regulatory relationship constructed solely based on the trend
of changes in TFs and target genes may lack biological
significance. Introducing prior information such as reference
networks can help improve the interpretability of the network.
Therefore, the inferCSN method can be used to analyze the GRN of immune
cells in different states within the tumor microenvironment at the
single-cell level, compare the differences in regulatory networks under
different T cell states, and discover immune suppression related
signaling pathways. Moreover, this method can compare the GRNs of
different tumor subclons to identify the immune escape pathways
dominated by different subclons, which helps to achieve precise
immunotherapy.
Although inferCSN has achieved certain advantages, it still needs
improvement. For example, the edges between nodes in CSN are
established based on statistical relationships, but in fact, even if
the expression level of TFs remains unchanged, the expression level of
target genes may also change. Although external reference information
is embedded in inferCSN, the construction method of the reference
network may still result in some false positive edges in the regulatory
network. Therefore, it is still necessary to reasonably integrate other
types of data (such as ATAC-seq data) for CSN inference methods.
Methods
Simulated and real datasets
We collected 24 NSCLC patients’ scRNA-seq dataset (The 10 NSCLC
patients were downloaded from Philip et al.’s study^[182]42, and
another 14 NSCLC patients’ were downloaded from the GEO
database^[183]43) and 200 benchmark scRNA-seq datasets to evaluate the
performance of inferCSN^[184]14. The logarithmic transformation on
fragments per kilobase million (FPKM) values were calculated, and genes
expressed in less than 10% of cells were filtered out. 800 protein
coding genes with the highest average expression value were selected
for constructing CSNs. The preprocessed dataset contains 42,369 cells
from 6 cell types, including 8288 tumor cells, 13,640 conventional T
cells, 6086 natural killer cells (NK), 5738 CD8 + T cells, 2531 plasma
cells, and 6068 B cells.
We used inferCSN to construct regulatory networks of T cells in
different states, in order to demonstrate its practical application in
discovering pathways related to T cell immune function inhibition. In
this study, preprocessing techniques were performed on 14 NSCLC
patients downloaded from the GEO database, such as extracting the count
expression matrix of CD8 + T cells, removing genes with average counts
less than 1, normalizing the counting matrix, and performing
logarithmic transformation. Finally, 12,306 protein coding genes and
2508 CD8 + T cells were obtained, including 303 naive cells, 206
intermediate cells, 674 GZMK labeled pre-dysfunctional cells 832 ZNF683
labeled pre-dysfunctional cells and 439 dysfunctional cells spanning
six different CD8 + T cell states.
Moreover, we also used inferCSN to construct regulatory networks of
tumor cells in different subclones, in order to demonstrate its
practical application in discovering pathways related to tumor immune
escape. Here, we collected a total of 224,611 cells from 7 publicly
available datasets: GSE13190728, GSE13624640, GSE14807142, GSE15393544,
[185]GSE127465, GSE11991149, and KU_ Loom
([186]https://gbiomed.kuleuven.be/scrnaseq-nsclc). After preprocessing,
there were 3385 alveolar cells, 1462 CXCL1 cancer cells, 345 LAMC2
cancer cells, 3017 pathological alveolar cells, 279 proliferating
cancer cells, and 2169 SOX2 cancer cells.
The 200 benchmark scRNA-seq datasets was generated by the Boolean
differential equation simulation^[187]14 proposed by Praapa et al.,
which includes four different types of time trajectory information,
such as Bifurcating (BF), Cycle (CY), Linear (LI) and long Linear (LL).
Each kind of time trajectory contains 50 datasets covering different
cell counts (2000, 4000, 6000, 8000 and 10,000).
Experimental configuration and evaluation metrics
For performance evaluation with benchmark networks, simulated scRNA-seq
datasets are generated by specifying regulatory relationships. The
performance evaluation on the real scRNA-seq dataset used 260,962 gold
standard positive gene pairs provided by the HumanNet database as the
benchmark network^[188]44.
We compare the inferred CSN with the reference benchmark network, and
count the number of true positive (TP), true negative (TN), false
positive (FP), and false negative (FN) edges to calculate metrics such
as true positive rate (TPR, Formula 1), false positive rate (FPR,
Formula 2), Precision (Formula 3), and Recall (Formula 4).
[MATH:
TPR=T
mi>PTP+FN
:MATH]
1
[MATH:
FPR=FPTN+FP
:MATH]
2
[MATH:
Precision=<
mi>TPTP+FP :MATH]
3
[MATH:
Recall=
TPTP+FN :MATH]
4
Calculate the area under receiver operating characteristic curve
(AUROC) using FPR as the horizontal axis and TPR as the vertical axis,
and calculate the area under precision recall curve (AUPRC) using
Precision and Recall. When calculating AUROC and AUPRC, only the
regulatory direction of the TF-gene regulatory relationship was
considered, without considering whether the regulatory relationship was
activated or inhibited. The higher the AUROC and AUPRC, the higher the
accuracy of the method in constructing the CSN.
Combining pseudo time inference and generalized additive models
Monocle3^[189]8 aims to help researchers understand the biological
processes at the single-cell level, including cell development, state
transition, and cell type recognition. Monocle3 adopts uniform manifold
approximation and projection (UMAP) for dimensionality reduction, then
divides cells into discontinuous trajectories using the PAGA algorithm,
and finally calculates the pseudo time. Slingshot is also a pseudo
temporal inference method^[190]45, which applies principal component
analysis (PCA) by default, and then uses the minimum spanning tree
method to infer the hierarchical structure of cell clusters and align
cells in pseudo time. Because inferCSN requires temporal information
reflecting the dynamic differentiation of cells (Fig. [191]1a),
Slingshot and Monocle3 were integrated into inferCSN for inferring the
pseudo time, and then cells were sorted based on the obtained pseudo
time information^[192]46. Meanwhile, we selected the branch with a
higher number of cells for subsequent window partition from the pseudo
time results.
The generalized additive model (GAM) is a regression model that can
eliminate the non-linear relationship between the dependent variable
and the independent variable by smoothing the independent variable. It
uses a smooth spline function to capture the relationship between gene
expression and time, in order to identify dynamic genes related to
temporal changes. The GAM is formulated as
[MATH: yi=β0+
mo>f(ti)+<
/mo>ϵi :MATH]
, where
[MATH: yi :MATH]
is the expression value of the gene,
[MATH: ti :MATH]
is its corresponding pseudo time,
[MATH: f(ti) :MATH]
is the nonlinear smoothing function, and
[MATH: ϵi :MATH]
is the error term following a normal distribution. The gam package
implements the GAM, and the smoothness of the model is automatically
determined by Generalized Cross Validation, and default parameters are
used in this study. The gam package can simultaneously calculate the
residual variance of the GAM model and use F-test to evaluate the
statistical significance of the model, thereby obtaining the P value of
each gene. To control for false positives introduced by multiple
hypothesis testing, we used the Bonferroni method to correct for P
values and ultimately selected genes with adjusted P values less than
0.0001 as dynamic genes.
Sliding windows
It is necessary to use sliding windows for processing temporal data,
otherwise critical time points may be missed. Meanwhile, since the
distribution of cells in pseudo temporal information is not uniform,
the regulatory relationship tends to lean towards the high-density
regions of cells. To address such an issue, assuming that there are
different cell states in each cell type, Gaussian kernel density
estimation method is used to calculate the density of different cell
states based on the pseudo time information. The basic idea of kernel
density estimation is to apply a kernel function to fit each data
point, and then estimate the probability density function by weighted
averaging these kernel functions. The Gaussian kernel density
estimation (Formula 5) uses a Gaussian kernel function (Formula 6) for
each data point.
[MATH: fx=1n∑
i=1nKh(x−x
i) :MATH]
5
where n is the size of the dataset,
[MATH: xi
:MATH]
is the i-th data point, and K[h](x-x[i]) is a Gaussian kernel function
centered on x[i] with a bandwidth of h.
[MATH: Kx=12πσe−
x22σ2
mrow> :MATH]
6
where x is the distance between the data point and a given center
point, and
[MATH: σ :MATH]
is the bandwidth parameter of the kernel function, which determines the
width of the kernel function and thus affects the smoothness of the
estimated probability density.
Then, we calculate the intersection point between the density curves of
two adjacent cell states. If there are multiple intersection points
between the two cell states, use the intersection point at the highest
density point as the benchmark to redraw the boundary and set sliding
windows with variable width (Fig. [193]1c). Consequently, the interval
between the two consecutive intersection points is considered as a time
window. Finally, all cells are divided into multiple windows, each
consisting of cells with continuous states. The density function
implemented in R package stats achieves the Gaussian kernel density
calculation, and default parameters are used.
Construction of gene regulatory network based on L0 regularization
Hazimeh et al. proposed a combination optimization algorithm using
local combinatorial optimization and cyclic coordinate descent^[194]47,
which effectively solved the L0 regularization problem. In this study,
the combination of L0 and L2 regularization term, known as the L0L2
sparse regression model (Formula 7), can be applied to overcome the
characteristics of high dimensionality, high sparsity, and low
signal-to-noise ratio of scRNA-seq data.
[MATH:
argminβ
∈Rp
12
mfrac>y−Xβ22+γ1β0+γ2<
/mrow>β22 :MATH]
7
where
[MATH: y :MATH]
represents the expression level vector of the
[MATH: i :MATH]
-th target gene in the matrix (
[MATH: Y∈Rg×
m :MATH]
,
[MATH: Y :MATH]
is a matrix of m samples and g target genes),
[MATH: X :MATH]
is a matrix of m samples and p TFs,
[MATH: β :MATH]
is a regression coefficient vector,
[MATH: γ1
:MATH]
controls the number of non-zero coefficients of TFs, and
[MATH: γ2
:MATH]
controls the amount of shrinkage caused by L2 regularization.
The L0L2 model is used to construct the CSN for each window (Fig.
[195]1d), in which the weight of the TF-gene regulation is defined as
Formula 8. Note that the n-1 cell specific regulatory network
corresponds to n-1 windows, where n represents the number of cell types
or cell states.
[MATH:
wi=<
/mo>1n−1<
/mrow>∑1n−1βi∑1p∣βj∣ :MATH]
8
where p represents the number of TFs and
[MATH: βi
mrow> :MATH]
is the absolute value of the regression coefficient of TF-gene pairs in
each window. Next, each CSN is merged with the adjacency matrix of the
reference network (Fig. [196]1e) and normalize the matrix to obtain the
CSN for each cell type (Fig. [197]1f).
Construction of reference network
The protein coding genes defined by the consensus coding sequence
database (CCDS)^[198]48 are used for constructing a reference network.
In this study, building a reference network involves the following four
steps:
Step 1: Use the Bayesian method provided in the R package SAVER^[199]49
to calculate the missing values in the scRNA-seq count matrix, and then
exclude genes with zero values greater than 99% in each cell.
Step 2: Use the NormalizeData function of the R package Seurat to
logarithmically normalize the matrix derived, then calculate the
Pearson correlation coefficients (PCC) between gene pairs, and only
retain edges with PCC > 0.8.
Step 3: Use the R package MetaCell to denoise the matrix^[200]50, and
then use the mcell_ MC_ From_ Coclust_ Balanced function with the
parameters K = 30 and alpha=2 to generate a metacell matrix, and then
remove cells with UMIs less than 500. After that, we calculate the PCC
network between gene pairs based on the metacell matrix.
Step 4: Use R package bigSCale2 to perform Z-score transformation on
the processed matrix^[201]51, and use the transformed Z-score matrix to
calculate the PCC network.
The PCC networks obtained in steps 2, 3, and 4 were merged to obtain a
reference network with 1,125,494 edges. And Bayesian models and log
likelihood score (LLS) are used to evaluate the accuracy of the
reference networ^[202]52. The LLS is calculated for the reference
network sorted by weight, as shown in Formula 9.
[MATH:
LLS=logPL∣DP¬L∣DPLP¬L :MATH]
9
where L is the gold standard positive gene pair in reference network D,
[MATH:
P(L∣D) :MATH]
and
[MATH:
P(¬L∣D) :MATH]
represent the positive and negative probabilities of gold standard gene
pairs in the given dataset,
[MATH: P(L) :MATH]
and
[MATH:
P(¬L)<
/mo> :MATH]
represent the probabilities of gold standard positive gene pairs and
negative gene pairs, respectively.
In sum, a threshold is set to filter out low signal or noise
interactions. Only when the weight value of a gene pair exceeds this
threshold, the interaction between the gene pair is considered
significant in the corresponding cell type or state. Next, these
significant interactions are compared with the reference network. If a
gene pair exists in the reference network and its weight value passes
the threshold, the interaction is considered cell type specific and
ultimately retained in the generated CSN.
Supplementary information
[203]Supplementary.^ (1MB, pdf)
Acknowledgements