Abstract
Background
Many studies have utilized computational methods, including cell
composition deconvolution (CCD), to correlate immune cell polarizations
with the survival of cancer patients, including those with
hepatocellular carcinoma (HCC). However, currently available cell
deconvolution estimated (CDE) tools do not cover the wide range of
immune cell changes that are known to influence tumor progression.
Results
A new CCD tool, HCCImm, was designed to estimate the abundance of tumor
cells and 16 immune cell types in the bulk gene expression profiles of
HCC samples. HCCImm was validated using real datasets derived from
human peripheral blood mononuclear cells (PBMCs) and HCC tissue
samples, demonstrating that HCCImm outperforms other CCD tools. We used
HCCImm to analyze the bulk RNA‐seq datasets of The Cancer Genome Atlas
(TCGA)‐liver hepatocellular carcinoma (LIHC) samples. We found that the
proportions of memory CD8^+ T cells and Tregs were negatively
associated with patient overall survival (OS). Furthermore, the
proportion of naïve CD8^+ T cells was positively associated with
patient OS. In addition, the TCGA‐LIHC samples with a high tumor
mutational burden had a significantly high abundance of nonmacrophage
leukocytes.
Conclusions
HCCImm was equipped with a new set of reference gene expression
profiles that allowed for a more robust analysis of HCC patient
expression data. The source code is provided at
[28]https://github.com/holiday01/HCCImm.
Keywords: bulk gene expression profiles, CD8 T cell, deconvolution,
hepatocellular carcinoma, immune cell polarization, therapy response
subtypes
__________________________________________________________________
The HCCImm model, a cell composition deconvolution tool, was developed
to accurately estimate the abundance of 17 cell types within the
microenvironment of hepatocellular carcinoma (HCC). The differentially
expressed genes (DEGs) were used to construct an HCC reference gene
expression signature matrix. Subsequently, predicted abundance of HCC
cells were excluded, and the proportions of immune cell types were
normalized. Survival analysis was performed to investigate the
relationship between immune cell abundance and patient prognosis. This
comprehensive workflow enables a detailed exploration of gene
expression patterns and their impact on patient outcomes. Notably, the
abundance of CD8 T cells has been identified as a significant factor
associated with prognosis and the effectiveness of drug treatments.
graphic file with name CAM4-12-15736-g012.jpg
1. INTRODUCTION
Primary liver cancer is the seventh most common cancer type and the
second most common cause of cancer‐related mortality in the world,[29]
^1 and hepatocellular carcinoma (HCC) accounts for nearly
three‐quarters of all liver cancer cases and is the dominant type of
liver cancer.[30] ^1 Treatment of advanced HCC is challenging, and
systemic chemotherapy has toxic side effects and no survival
benefits.[31] ^2 , [32]^3 Recently, immunotherapy, in particular immune
checkpoint inhibition (ICI), has been used as a clinical treatment for
HCC. By blocking the cancer cell‐mediated suppression of immune cells,
they can be reactivated and kill cancer cells. ICI has been approved as
a first‐line treatment for advanced HCC cases[33] ^4 ; however, only a
subset of patients seem to benefit from ICI therapy. It has been
hypothesized that the poor ICI response rate might be related to
individual differences in the types and relative abundances of immune
cells in the tumor microenvironment (TME).[34] ^5 In addition, Wang and
Li suggest that ICIs might not be a particularly favorable method to
treat HCC because immune checkpoint genes are not highly expressed in
LIHC patients that have a high tumor mutational burden (TMB),[35] ^6
which suggests that the characteristics of the HCC TME are different
from those of other cancers.
The TME is a complex ecosystem that consists of a range of different
types of cells; these different cell types include not only cancer
cells but also numerous interstitial cells and infiltrating immune
cells. The abundance and density of these tumor‐infiltrating immune
cells (TIICs)[36] ^7 may be associated with patient survival and cancer
treatment efficacy. TIICs seem to interact with cancer cells or other
noncancer interstitial cells in ways that are not yet fully
understood.[37] ^8 Immune cells might be able to modulate cancer
progression by regulating the growth and metastasis of tumor cells.
There are also likely to be interactions between immune cells via
cytokines or surface proteins, which creates a complex network among
immune cells.[38] ^9 , [39]^10
Immune cells are regulated by a variety of cytokines as well as by
cell–cell contacts, which results in their polarization to perform a
number of distinct functions. The immune cells present in a given
sample may have been polarized in a number of different directions, and
these complex changes impose unexpected influences on the prognosis of
cancer patients. In recent years, many studies have found that
different polarizations of the same immune cell type may have diverse
effects on cancer cells. Some polarized forms are able to inhibit the
growth of cancer cells or even kill them, while others are able to
promote the growth and/or metastasis of cancer cells.[40] ^11 For
example, macrophage cells can be polarized into classically activated
macrophages (M1), or alternatively, they can become alternatively
activated macrophages (M2). M2 cells promote the growth of cancer
cells, while M1 cells inhibit the differentiation of cancer cells.[41]
^12 A few studies have reported that a high M1/M2 ratio is associated
with greater survival of cancer patients.[42] ^13 , [43]^14
Furthermore, M2 cells can be further divided into four subtypes,
namely, M2a, M2b, M2c, and M2d, each of which has been suggested to
exert different effects within the TME. Tumor‐associated macrophages
(TAMs), which have an immunosuppressive role and protumor properties,
are referred to as M2d cells.[44] ^15 , [45]^16
In addition, CD8^+ T cells and their functional subsets have been
highlighted in many studies due to their roles in cancer
immunotherapy.[46] ^17 A higher density and a greater abundance of
CD8^+ T cells in the TME have been associated with a favorable
prognosis for cancer patients[47] ^18 ; however, in the TME of HCC
patients, resident memory CD8^+ T cells may differentiate into
exhausted phenotypes, and a higher abundance of such exhausted memory
CD8^+ T cells has been associated with the poor prognosis of HCC
patients.[48] ^19 In contrast, the activated CD8^+ T cells
differentiated from naïve CD8^+ T cells, compared to those effector
cells derived from memory CD8^+ T cells, exhibit a better resistance to
tumor‐induced immune suppression and reveal a more potent
tumor‐specific cytotoxic activity.[49] ^20
Exploring which polarized immune cells are significantly associated
with cancer patient prognosis is likely to allow new insights into the
application and development of immunotherapy. A useful approach is to
investigate the immune cell composition of publicly available tumor
sample datasets. Although experimental methods such as flow cytometry
and immunohistochemistry staining (IHC) are able to determine the
abundance of immune cell types, such information is usually unavailable
in public‐domain datasets.[50] ^21 Therefore, many cell composition
deconvolution (CCD) methods have been developed to infer the immune
cell admixture based on the bulk gene expression profile (bulk GEP)
data of tissues. These CCD methods include linear least‐square
regression,[51] ^22 , [52]^23 nonnegative matrix factorization,[53] ^24
, [54]^25 , [55]^26 quadratic programming,[56] ^27 , [57]^28 and
𝜈‐support vector regression.[58] ^29 Usually, when using a
deconvolution method, a set of cell‐specific gene expression profiles
have been curated and used as the reference (reference GEP), and this
reference GEP is then used to predict the relative abundance of
different immune cells from the bulk GEP of a given new cancer sample.
Among these computational deconvolution approaches, CIBERSORTx
developed by Newman et al. is one of the most widely used methods to
predict the immune cell composition of a bulk GEP derived from a tumor
sample.[59] ^29 , [60]^30 Moreover, marker gene‐based approaches, such
as single sample Gene Set Enrichment Analysis (ssGSEA), are able to
evaluate the expression values of the genes in a sample. These are
characteristic for each of the selected cell types and therefore
facilitate individual abundance estimations.[61] ^21
When different cell type quantification methods are applied to analyze
the bulk gene expression profiles of cancer patient tissue samples,
discrepancies in the results can occur when the predicted associations
of immune cells with HCC patient survival are explored. For example,
regulatory T cells (Tregs) are an immunosuppressive subtype of CD4^+ T
cells and are known to promote tumor growth. Tu. analyzed ICOS^+
FOXP3^+ Tregs in 57 HCC patient tissue samples by immunohistochemistry
(IHC) and reported that the infiltration of ICOS^+ FOXP3^+ Tregs showed
a negative correlation with patient survival.[62] ^31 In contrast to
this finding, Foerster et al. analyzed the TCGA‐LIHC dataset using a
marker gene‐based approach and found that a higher abundance of Tregs
was positively associated with patient survival.[63] ^32 Finally, Peng
et al. analyzed the same dataset with CIBERSORT and reported that the
abundance of Tregs was significantly higher in HCC patients with poor
prognosis.[64] ^33
These discrepancies among the results of previous studies seem to
reflect the fact that some of the approaches used have a lower level of
performance. According to benchmarks created from synthetic bulk gene
expression profile datasets, deconvolution‐based approaches usually
outperform marker gene‐based methods[65] ^21 ; however, it can be
clearly argued that further study in this area is needed. When
CIBERSORT was applied to predict immune cell compositions using cancer
patient datasets, a few limitations were noticeable. For example, when
using CIBERSORT to analyze the TCGA‐LIHC dataset, in as many as 80% of
the cases, the cell composition predictions were unable to pass the
Monte Carlo test (p‐value >0.05).[66] ^32 This finding suggests that
when using CIBERSORT, the association between cell admixture and
patient survival might be biased because only a small subset of LIHC
patients was assessed. In addition, when deconvolution‐based methods
are used, Sturm et al. noticed that nonspecific signature genes can
cause high background predictions, or “spillover,” for certain cell
types.[67] ^21 As cancer cells are highly likely to make up the largest
cell group in a tumor and thus the highest proportion in the TME, we
believe that there is a concern that, without specifically including
cancer cells in the reference GEP, a deconvolution‐based approach might
give rise to additional spillover due to the gene expression signals
contributed by the cancer cells in the sample.
Exploring the roles of infiltrating lymphocytes and other immune cells
in the TME is a popular research topic. Available evidence supports the
hypothesis that the cell heterogeneity present in the TME seems to have
an impact on cancer patient survival and response to therapy.[68] ^34 A
better understanding of the complexity present within the TME is likely
to provide further insights that will allow the development of new HCC
therapies. Based on recent studies of human HCC samples using
single‐cell RNA‐seq,[69] ^35 , [70]^36 there is high cell heterogeneity
in TME of liver cancer, containing a diversity of immune cell types,
including dendritic cells (DCs), natural killer (NK) cells, CD4^+ T
cells, CD8^+ T cells, and TAMs. Zhang et al. also found that Treg and
exhausted CD8^+ T cells were enriched in human HCC tumor tissues and
suggested that it is necessary to clarify their association with HCC
patient survival.[71] ^35
CCD methods may have limitations due to spillover from the overlapping
gene expression profiles among different cell types.[72] ^21 Various
CCD tools, including CIBERSORTx,[73] ^30 EPIC,[74] ^37 TIMER,[75] ^38
Dtangle,[76] ^39 TumorDecon,[77] ^40 and SCDC,[78] ^41 have been
developed to predict the abundance of immune cells. For example,
CIBERSORTx can predict 22 types of immune cells. EPIC can estimate the
abundance of eight cell types, including immune cells and
cancer‐associated fibroblasts. TIMER provides immune cell estimations
for diverse cancer types. Dtangle can estimate 11 types of immune cells
in Lyme disease. TumorDecon can predict 12 types of immune cells for
quantitative pharmacology models. SCDC uses an ENSEMBLE approach to
develop CCD algorithms. However, despite the improved capabilities of
these tools, spillover from the gene expression profiles of HCC cancer
cells remains a challenge for CCD methods.
Our tool, HCC Immune Estimating (HCCImm), has been explicitly designed
to analyze the immune TME in liver cancer by a unique reference gene
expression signature matrix (refGES) built from a liver cancer cell
line to improve the accuracy of cell type deconvolution. As a result,
HCCImm can more accurately distinguish between immune cells and liver
cells than other CCD methods.
Therefore, this study aims not only to create a new CCD method that can
be applied to analyze the TCGA‐LIHC dataset but also to more diversely
quantify the polarized populations of immune cells. The inferred immune
cell compositions could be associated with HCC patient survival. To
this end, we built our reference gene expression profile by including a
greater range of immune cell types. For example, we have included more
differentially polarized forms of macrophages than just the binary
categorization into M1 and M2. We also expanded the types of CD8^+ T
cells included and further divided them into two subtypes, namely,
naïve and memory CD8^+ T cells, as each has specific immune
characteristics. Cytotoxic T cells expressing the CD8 cell‐surface
marker are the most powerful effectors associated with the cancer
immune response,[79] ^42 but in the TME, tissue‐resident memory CD8^+ T
cells may express various dysfunctional markers, such as PDCD1 and
CTLA4[80] ^43 ; these markers have also been included. As a comparison,
CCD methods such as CIBERSORTx and quanTIseq can predict only a single
fraction value for the CD8^+ T‐cell type.[81] ^29 , [82]^30 , [83]^44
2. STUDY DESIGN
The workflow overview of this study is presented in Figure [84]1. To
construct the reference gene signature matrix, which was used for
regression of the immune cell types in HCC samples, we collected the
gene expression profiles of 17 distinct types of cells (Figure [85]1A).
These cell types were an HCC cell line, memory B cell, naïve B cells,
DC cells, M1 cells induced by interferon γ (IFN γ), M1 cells induced by
IFN γ, and tumor necrosis factor (TNF), M1 cells induced by
lipopolysaccharide (LPS), M2 cells induced by IL4 and dexamethasone
(DEX), M2a cells induced by IL13 or IL4, M2c cells induced by IL10, NK
cells, naïve CD8^+ T cells, memory CD8^+ T cells, naïve CD4^+ T cells,
T helper 1 cells (Th1 cells), T helper 2 cells (Th2 cells), and Tregs
(Table [86]1). The different macrophage polarization forms in the gene
expression profiles are labeled according to the corresponding
cytokines used for their in vitro induction.[87] ^45
FIGURE 1.
FIGURE 1
[88]Open in a new tab
The workflow of this study (A) The gene expression profiles of 17 cell
types were downloaded from NCBI GEO. (B) Creating refGES for 17 types
of cells. The differentially expressed genes (DEGs) within each of the
seven groups were ranked by p‐value, and then minimization of the
condition number was performed to select the top DEGs. (C) HCCImm was
built using ε‐SVR regression. After removing the predicted fraction of
HCC cells, the fractions of the remaining 16 types of immune cells were
normalized to sum 100%. (D) Survival analysis was performed to
determine whether there was an association between the predicted
abundance of immune cell types and patient prognosis.
TABLE 1.
Cell types and replicates used to build the reference gene expression
signature matrix.
Cell type Replicates Cell type Replicates
HCC_cell_lines 19 B_memory 18
M1_IFNγ 9 B_naive 25
M1_IFNγ_TNF 4 Memory_CD8_Tcell 12
M1_LPS 6 Naïve_CD8_T 15
M2_IL4_Dex 15 Naïve_CD4_T 10
M2a 12 Treg 10
M2c 8 Th1 7
DC_cell 13 Th2 9
NK_cell 12
[89]Open in a new tab
To create the refGES for CCD, one idea might be to perform ANOVA and
t‐test analysis to identify the differentially expressed genes across
the 17 cell types. However, within the 17 cell types, there are closely
related subsets, and each cell‐type subset is likely to have highly
correlated expression profiles. For example, the three types of M1
cells induced by different cytokines, as listed in Table [90]S1, have
very similar gene expression profiles, as revealed by cluster analysis.
If the cell‐specific signature genes were collected by investigating
the differentially expressed genes (DEGs) in the 17 cell types via
ANOVA, it is possible that the acquired cell signature genes might not
be sufficient to distinguish these highly correlated cell types. In
addition, highly correlated independent variables can cause
multicollinearity during linear regression. To mitigate these concerns,
we created our refGES by specifically choosing genes that are
differentially expressed between the closely related cell subtypes
(Figure [91]1B).
For the implementation of the computational code for HCCImm, we used
the linearSVR program in the Python package “scikit‐learn.” We chose
ε‐support vector regression (ε‐SVR), subject to an L1 penalty, as the
core algorithm to perform the deconvolution in this study. Our
ε‐SVR‐based deconvolution method, HCCImm, was designed in‐house to
analyze bulk expression profiles from samples with an unknown cell
composition to predict the proportions of the 16 immune cell types
present in liver cancer tissue samples.
3. IMPLEMENTATION
3.1. The refGES
The aim of this study was to quantify the 16 immune cell types
described above, as well as HCC cells, present in the TME of HCC, and,
therefore, 196 microarray samples were downloaded from NCBI GEO, and at
least four replicate samples were included for each of the 17 cell
types (Figure [92]1A). All of the samples downloaded from NCBI GEO were
analyzed using the same platform, Affymetrix U133 plus 2 (NCBI GEO
platform accession: [93]GPL570). The numbers of replicates for the
different types of cells are listed in Table [94]1, and the NCBI GEO
sample accession numbers for those microarray samples are provided in
Table [95]S1.
The microarray data were quantile normalized using the Robust
Multiarray Average (RMA) procedure offered in the R package “affy”.[96]
^46 ANOVA was performed using the R package “statistics” to identify
the genes that had significantly higher expression in certain types of
immune cells compared to that in other cell types. The expression
values of those differentially expressed genes (DEGs) were used to
build the initial matrix for the gene expression signatures of the
immune cells.
Our approach to construct the refGES consisted of three steps
(Figure [97]1B). First, in addition to the 17 cell‐type group, we
further divided the 17 cell types into six groups. Each of the six
immune cell groups consisted of the cell types that had closely
correlated gene expression profiles (Table [98]S1). For example, the M1
group consisted of M1_LPS, M1_IFNγ, and M1_IFNγ_TNF, while the M2 group
consisted of M2a, M2_IL4_DEX, and M2c.
Second, to construct the refGES, differentially expressed genes were
selected from each of the seven cell groups, including the 17‐cell‐type
group and the six groups of immune cell subsets. In each of the seven
groups, all of the genes were sorted by their p‐values in ascending
order. The top k differentially expressed genes were collected from
each group.
Third, the median expression values of these selected differentially
expressed genes from each of the 17 cell types were used to build the
initial refGES, with the dimension of approximately k genes × 7 groups
× 17 cell types × iteration #. We set k = 16, but other k values might
work equally as well. A differentially expressed gene in one of the
seven groups might be found to be differentially expressed again in one
or more of the six groups, and therefore, in such circumstances, the
number of genes to be included to build the refGES could be less than k
in each iteration for each of the seven cell type groups. Next, we used
condition numbers to determine how many differentially expressed genes
should be included in the refGES. Next, we used QIAGEN's Ingenuity^®
Pathway Analysis (IPA^®, QIAGEN) to identify the diseases and functions
associated with these genes (Supplementary Figure [99]S1).
To make the refGES more robust against input variations or noise, we
used condition numbers as a means to determine how many differentially
expressed genes should be included in the final matrix. Technically,
the condition number of a matrix is the product of the norm of the
matrix and the norm of its inverse. It is a measurement of how
sensitive a mathematical function might be to changes or errors in the
input data. In numerical analysis, a function with a low condition
number is considered to be well‐conditioned and likely to be a
relatively stable solution when there are small fluctuations in the
input data. A number of CCD methods have applied the minimization of
condition numbers to optimize the refGES.[100] ^22 , [101]^27 ,
[102]^29 Specifically, the R function “kappa” was used to estimate the
condition number of each matrix. Finally, we created the refGES matrix
that consisted of the median gene expression values of 201 genes
expressed in the 17 types of cells, including 16 immune cell types, as
well as HCC cells (Table [103]S1).
Each of the cell types listed in Table [104]1 can be further
categorized into distinct subsets. However, these subsets were not
inputted as immune cell subtypes when building our refGES. For example,
in the reference microarray dataset of memory CD8^+ T cells, three
subsets were noted, including central memory CD8^+ T cells, effector
memory CD8^+ T cells, and CD45RO^− memory CD8^+ T cells. However, when
building our refGES, all three memory CD8^+ T subsets of microarray
data were regarded as a single type of memory CD8^+ T cells. We chose
to use this kind of approach because we noticed that the gene
expression patterns of these subsets belonging to the same cell types
bear high resemblance. When using the refGES consisting of additional
immune cell types corresponding to these distinct subsets, the
regression model made poor predictions in benchmark testing using the
PBMC samples or the samples containing only pure cell types, deviating
immensely from the true proportions.
Perhaps a critical reason to prevent us from further dividing the cell
subsets to include more cell subtypes in our method is due to the
resolution limits of expression signal deconvolution. One challenge in
developing CCD methods is the high correlation of the expression
profiles of closely related cell types.[105] ^47 The gene expression
signals derived from low‐abundance cell types might be masked by the
expression of the same gene(s) in a more abundant cell subset.[106] ^9
Moreover, a remaining issue is whether the constructed refGES can be
applied to estimate the cell compositions in samples in which there are
certain subsets of immune cells that do not align perfectly with the
reference cell types in our refGES. For example, it is unclear whether
the predicted fraction of memory CD8^+ T cells reflects the true
compositions in samples containing distinct subsets of memory CD8^+ T
cells. Such subsets of memory CD8^+ T cells are referred to as
nontypical memory CD8^+ T (NT‐MCD8T) cells in the following text to
distinguish them from the reference memory CD8^+ T cells samples used
to build our refGES for the HCCImm method. Therefore, to address this
issue, at least partially, an assessment was carried out by using two
additional datasets, NCBI GEO [107]GSE85074 and [108]GSE26495, in which
there are microarray data of memory CD8^+ T cells that are either
PD1^high, PD1^low, effector, or central subtype (Table [109]S2). This
assessment revealed that the gene expression profiles of these subsets
of memory CD8^+ T cells are indeed more similar to those of memory
CD8^+ cells than to those of naïve CD8^+ T cells in our reference
datasets.
To estimate the HCC cell fraction in the HCC samples, we decided to use
the gene expression profiles of Huh7 cells to build our refGES. An
important reason is that the GEP of the Huh7 cell line, which is a
human hepatoma‐derived cancer cell line, mimics that of human
HCC‐derived metabolic genes.[110] ^48 To further assess whether Huh7
might be superior to other HCC cell lines as a representative of HCC
tumor cells, we performed a correlation analysis to determine the
correlation of Huh7 and HepG2 cell lines with human HCC samples. The
NCBI GEO accession numbers of these microarray data, including 10 Huh7,
9 HepG2, and 50 human HCC samples, are listed in Table [111]S3. The
GEPs of liver tumor samples derived from 50 HCC patients (NCBI GEO
[112]GSE45267) were more similar to those of Huh7 cells than to those
of HepG2 cells (Supplementary Figure [113]S2).
For CCD method benchmarking, it is essential to map the cell types of
the different methods to the same cell‐type ontology. For example,
CIBERSORTx is designed to estimate the fractions of 22 cell types in a
cell admixture,[114] ^29 , [115]^30 whereas quanTIseq can quantify only
10 immune cell types.[116] ^44 Since different methods elucidate cell
types in more or less detail, it is a common practice when assessing
CCD methods such as CIBERSORTx to sum up the predicted fractions for
the cell subsets belonging to the same cell type. Conversely, when CCD
methods can give only a single fraction value for one cell type but not
the detailed fractions for distinct cell subsets belonging to that one
cell type, this predicted fraction corresponds to the sum of the cell
subset fractions.[117] ^44
3.2. ε‐support vector regression
One common CCD approach is the use of linear models, where the gene
expression levels of a mixed sample (x) are modeled as a linear
combination of the gene expression levels of the individual cell types
that make up the sample. In this context, bulk GEP (b) refers to the
gene expression profile of the mixed sample, while refGES (a) refers to
the reference gene expression profiles of the individual cell types.
The linear model is used to calculate the proportions of each cell type
in the mixed sample by solving a system of linear equations.
Deconvolution can be conceived as finding the solution to the
convolution equation:
[MATH: bi=ai,<
mn mathvariant="italic">1x1+ai,<
mn mathvariant="italic">2x2…ai
,jxj :MATH]
where b [i ]is the expression level of gene i in a sample of the cell
mixture, a [i,j ]is the expression level of gene i in cell type j
(derived from the reference expression signature for this cell type),
and x [j ]is the unknown proportion of cell type j in the cell mixture.
Among the available CCD tools, CIBERSORTx uses a 𝜈‐support vector
regression (SVR)‐based approach, and this tool has outperformed other
tools in benchmarking experiments.[118] ^29 Since SVR seems to be
superior to the other regression methods in making predictions, we
chose one type of SVR, ε‐insensitive support vector regression (ε‐SVR)
which can provide a sparse solution, as the core algorithm to perform
deconvolution in this study. Our ε‐SVR‐based method is named HCCImm. To
implement HCCImm, we used the linearSVR function in the Python package
“scikit‐learn.” Unlike the 𝜈‐SVR used by CIBERSORTx, ε‐SVR does not
control the proportion of the support vectors to be used in the final
model.[119] ^49 As a result, an ε‐SVR‐based deconvolution model might
have higher flexibility when combining different predictor variables
without setting a lower bound for the support vectors. After all, it
would be arbitrary to decide a lower bound for the immune cell types of
one bulk tissue without prior knowledge of its cell composition. In
addition, the L1‐loss function was applied to minimize the mean average
error (MAE) between the predicted and real values.[120] ^50 Compared to
the L2‐loss function, the L1‐loss function makes the regression model
more robust to outlier values in the input data. The L2‐loss function
can produce much larger errors for outliers because this approach takes
into consideration the squared differences.
3.3. Other linear regression models
In addition to ε‐SVR, we tested a number of other linear regression
models, each being subject to one of the three types of regularization.
The objective function and the loss terms are presented below.
[MATH: Objective
function:min100i∑n=1ibn−bn^bn<
/mfrac>+regularized term,where :MATH]
[MATH: b^=ai,<
mn>1x1+ai,<
mn>2x2…ai,j
xj :MATH]
[MATH: L1loss:α∑x :MATH]
[MATH: L2loss:α∑x2 :MATH]
[MATH: ElasticNetloss:α∑x+β∑x2
mn> :MATH]
[MATH: Subject
tox≥0
:MATH]
3.4. Benchmarking of cell deconvolution methods
3.4.1. Analysis of pure cell types
The benchmarking test to assess the performance of HCCImm initially
used pure cell types and followed a leave‐one‐out strategy. In each
test run, one of the 196 microarray samples obtained from NCBI GEO to
build the refGES for the seventeen cell types was used as the test
dataset for deconvolution, while the remaining 203 samples were used to
build the refGES for testing using the aforementioned ANOVA‐based
procedure. Bar plots were produced to show the distributions of the
predicted cell‐type fractions in those pure‐cell samples. The results
were compared with the predictions made by CIBERSORTx and quanTIseq, as
well as the least absolute shrinkage and selection operator (LASSO),
L1‐, L2‐, and elastic net‐regularized linear regressions.
3.4.2. Analysis of memory CD8 ^+ T‐cell sample subsets by HCCImm
Memory CD8^+ T cells have many subsets, including central memory CD8^+
T cells, effector memory CD8^+ T cells, and others, such as PD1^+ CD8^+
T cells. To confirm which memory CD8^+ T‐cell subsets HCCImm could
estimate, two additional CD8^+ T‐cell datasets, [121]GSE85074 and
[122]GSE26495, were assessed using the model.
3.4.3. Analysis of human PBMCs by flow cytometry
The performance of HCCImm was also assessed by using real mixtures of
gene expression profiles derived from PBMCs. First, a set of microarray
samples of human PBMCs, [123]GSE107990, was downloaded from NCBI GEO;
this dataset has 164 samples. Since each of these PBMC samples had its
immune cell composition determined by flow cytometry, we compared the
cell abundance predicted by HCCImm with the experimentally determined
composition. The platform used to obtain those microarray samples was
the Illumina HumanHT‐12 V4.0 expression beadchip and the raw dataset
was quantile‐normalized using the R package “preprocessCore.” We
followed a simple cross‐platform quantile normalization approach that
is described in a previous study.[124] ^51 Mean absolute errors (MAEs)
and Pearson's correlation coefficients were calculated to compare the
performances of HCCImm against those of other CCD tools.
Next, to assess whether HCCImm can be applied to analyze the cell
composition of RNA‐seq datasets, such as TCGA cancer samples, a set of
nine RNA‐seq samples derived from the human PBMC dataset [125]GSE107572
was downloaded from NCBI GEO; these samples had their composition
confirmed by flow cytometry for eight types of immune cells.[126] ^44
To perform this benchmark test, the raw counts of the RNA‐seq data were
transformed by transcripts‐per‐millions (TPM) normalization. Following
this, the normalized values were further adjusted to make the RNA‐seq
data have a distribution with a maximum value close to that of the
microarray datasets used to build our refGES. When using CIBERSORTx and
quanTIseq, gene expression values in TPM were used, as has been
suggested by two papers.[127] ^44 , [128]^52 The cell type mapping from
the predictions made by CIBERSORTx, quanTIseq, and HCCImm is provided
in Table [129]S4(A).
A common approach for evaluating the performance of CCD tools involves
mapping the predicted fractions of immune cell subtypes to higher‐level
immune cell types.[130] ^21 , [131]^34 This strategy typically starts
by comparing the predicted fractions of immune cell subtypes with true
experimentally obtained fractions. In cases where the true fraction is
only available for a higher‐level immune cell type, but not for its
constituent subtypes, the predicted fractions of the subtypes are
usually aggregated to estimate the fraction of the higher‐level immune
cell type.
As an example, HCCImm can predict the fractions of two subtypes of
CD8^+ T cells: naïve CD8^+ T cells and memory CD8^+ T cells. In the
evaluation using the [132]GSE107990 dataset, the predicted fractions
for these subtypes were summed to obtain an aggregated fraction value
for the overall CD8^+ T‐cell population (Table [133]S4A).
Other CCD tools have been assessed in a similar manner, such as
CIBERSORT, CIBERSORTx, and quanTIseq.
3.4.4. Analysis of the simulated bulk gene expression profiles of liver
cancer tissue samples
To evaluate the ability of HCCImm to accurately estimate the cell
composition of the highly cancerous HCC TME, we conducted in silico
simulations by spiking the expression signals of 16 immune cell types
into the microarray data of an HCC cell line. We simulated bulk gene
expression profiles of liver cancer tissue by randomly selecting one
HCC cell sample from five datasets (GEO [134]GSE68927, [135]GSE112788,
[136]GSE128517, [137]GSE75024, or [138]GSE78736). Then the expression
signals for the 16 immune cell types were randomly sampled from the 185
immune cell microarray samples, as shown in Table [139]S1.
[MATH: AsimulatedHCCgene expression
profile=PHCC*HCCGEP+∑Pj
*GEPj,wherePHCC+∑Pj
=1
:MATH]
j represents the immune cell type. Each simulated profile was generated
by summing the GEPs of 17 cell types and weighting them according to
their respective proportions which were randomly assigned.
Specifically, we generated 100 simulated HCC tissue profiles for each
of the six different cancer‐cell fractions (P [HCC ]) ‐ 15%, 30%, 45%,
60%, 75%, and 90%. Therefore, in each set of 100 simulated profiles,
each profile had a predetermined P [HCC ]fraction of gene expression
signals derived from a randomly selected HCC cell microarray sample,
and the remaining fraction (100% ‐ P [HCC ]) was composed of expression
signals of 16 immune cell microarray samples which were randomly
selected and weighted.
In each test run, deconvolution was performed by taking the remaining
microarray samples of HCC cells and immune cells (excluding the one HCC
cell sample and the 16 immune cell samples used to simulate this bulk
profile with spike‐ins) to build the refGES that was to be used for
testing. When assessing the predictions, a normalization was performed
such that the fraction making up the HCC cell line was discarded
(Figure [140]1C), and the fractions of the 16 types of immune cells
were summed to 100%.
To investigate the agreement of the predictions made by the various CCD
methods, including ours, Pearson's correlation coefficient and absolute
error (AE) were calculated. Boxplots were produced to show the
distributions of AEs, and the results were compared with the AEs of the
predictions made by other methods. The cell type mapping to the ground
truths from the predictions made by CIBERSORTx, quanTIseq, and HCCImm
is provided in Table [141]S4(B,C).
3.5. Analysis of TCGA‐LIHC samples
To predict the cell composition of HCC samples and to perform survival
analysis of the patients, 371 RNA‐seq samples from the TCGA‐LIHC
dataset[142] ^53 were downloaded from the NCI Genomic Data Commons (NCI
GDC, [143]https://gdc.cancer.gov/) (Figure [144]1C,D). In addition, the
tumor mutation data of the TCGA‐LIHC samples were obtained from the MC3
project
([145]https://gdc.cancer.gov/about‐data/publications/mc3‐2017).[146]
^54 Finally, the drug response data of the TCGA‐LIHC patients, curated
by Ding et al., were downloaded from
[147]http://lifeome.net/supp/drug_response/.[148] ^55
4. RESULTS
4.1. Building the refGES
Using the approaches mentioned in the Methods section, including
prioritizing the DEGs in the cell subsets, and using the condition
number calculation, a set of 201 DEGs was selected to build our refGES
(Figure [149]1B, Supplementary Figure [150]S3). Thus, the dimension of
the final refGES matrix used in this study was 201 genes × 17 cell
types. The refGES matrix is provided in Table [151]S5.
These 201 genes as a whole are described as the signature genes (SGs)
in the following text. To reveal the clustering of the 196 microarray
samples, these SGs were used to generate the t‐SNE (t‐distributed
stochastic neighbor embedding) plot,[152] ^56 and from this plot, it is
clear that the samples form a number of distinct clusters, each
corresponding to a single cell type (Figure [153]2). In contrast, when
using the t‐SNE plot generated with all of the ~20,000 gene expression
values as the sample features, some clusters consisted of samples that
belong to different cell types (Supplementary Figure [154]S4). For
example, clusters made up of heterogeneous cell types were found for
the DC samples, the M2c samples, the M1_LPS samples, and the M2a
samples. Furthermore, memory CD8^+ T‐cell samples and naïve CD8^+
T‐cell samples overlapped. Finally, the NK, Th1, Th2, and naïve CD4^+
T‐cell samples did not form clear‐cut clusters (Supplementary
Figure [155]S4). This result suggests that the expression values of
these SGs can be used to distinguish the 17 types of cells that make up
the samples.
FIGURE 2.
FIGURE 2
[156]Open in a new tab
The t‐SNE plot of the 196 microarray samples using the cell type
signature genes (SG) as the sample features. Each data point represents
one sample and is color‐coded according to its cell type.
We performed pathway enrichment analysis to explore the functions of
the SGs and found that half of them were implicated in four types of
functions/diseases involving the liver, namely, liver damage, necrosis
of the liver, inflammation of the liver, and liver lesions
(Supplementary Figure [157]S1). In contrast, the genes in the
CIBERSORTx LM22 are only associated with one type of liver disease.
4.2. Concordance between the predicted cell composition and the true
proportions
4.2.1. Pure‐cell samples
HCCImm was then assessed by gene expression profile analysis against
the known proportions of immune cells. First, each of the 196
microarray samples was tested using the leave‐one‐out strategy as
mentioned in the Materials and Methods. Since each microarray sample
was derived from only a single type of immune cell, each of these
microarray samples was regarded as a pure‐cell sample, and the true
cell composition for each of the pure‐cell samples was 100% for its
specific type of cell.
In this benchmark test, the prediction made by HCCImm was able to
identify the composition of the major cell type in the pure‐cell
samples. The predictions made by HCCImm were consistent with the
original data for the seventeen types of cells. HCCImm, as well as the
regularized regression approaches using exactly the same refGES built
in this study, consistently predicted the dominant cell type with a
median fraction of ~80% and short error bars, with the predicted
fractions for other cell types being generally lower than 10%
(Figure [158]3). In contrast, the median values for the cell fractions
predicted by the two most widely used tools, namely, quanTIseq and
CIBERSORTx, were lower than 60%, and there were larger variances.
FIGURE 3.
FIGURE 3
[159]Open in a new tab
The bar plots for the benchmark using pure‐cell samples. Each graph
corresponds to the predicted fraction of the specific cell type of one
of the 196 pure‐cell samples. The performance of HCCImm, L1‐ and
L2‐regularized linear regression, LASSO, quanTIseq, and CIBERSORTx are
shown.
4.2.2. Memory CD8 ^+ T samples
To evaluate whether our refGES is suitable for performing cell
deconvolution analysis against the GEPs containing heterogeneous
subsets of memory CD8^+ T cells, an independent test was performed by
using two additional NCBI GEO datasets that were not used to build our
refGES. These two datasets consist of a series of GEPs, where each
corresponds to a distinct subset of memory CD8^+ T cells (NT‐MCD8T, see
Materials and Methods, 3.1). Therefore, this assessment can also be
conceived as an extended test using GEPs consisting of pure‐cell
samples.
First, by carrying out correlation analysis, we noticed that the GEPs
of these NT‐MCD8T data were more similar to those of the memory CD8^+ T
cells (MCD8T) used to build our refGES than to those of the naïve CD8^+
T cells (Supplementary Figure [160]S5). Next, we predicted the cell
compositions of these NT‐MCD8T data using a linear regression approach,
LASSO. We found that LASSO regression consistently predicted the
dominant cell type as MCD8T, with a high median fraction of ~74% and a
narrow interquartile range (IQR), and the sum of the predicted
fractions for other cell types was generally lower than ~34%
(Supplementary Figure [161]S6). The LASSO regression result suggests
that even though some subsets of memory CD8^+ T cells may not perfectly
match the features of MCD8T cells used in the refGES, our refGES is
suitable for analyzing GEPs that contain such memory CD8^+ T‐cell
subsets.
We used two sets of DEGs for further analysis. One DEG set was prepared
by comparing the GEPs of memory CD8^+ T cells and naïve CD8^+ T cells.
The other DEG set was prepared by comparing the GEPs of PD1‐high CD8^+
T cells (for the GEO accession, see Table [162]S2) and naïve CD8^+ T
cells. The significantly enriched pathways in both DEG sets are
involved in the regulation of exhausted CD8^+ T cells (Supplementary
Figure [163]S7), including calcium signaling,[164] ^57 mitochondrial
dysfunction,[165] ^58 the sirtuin signaling pathway[166] ^59 and
glucocorticoid receptor signaling.[167] ^60
4.2.3. PBMC microarray and RNA‐seq samples
Unlike the previous evaluation of our method, where only pure‐cell
samples were used, the next set of benchmark testing was based on real
human samples consisting of multiple cell types, namely, human PBMCs
with linked flow cytometry results. The purpose of this test was to
evaluate whether HCCImm still performs well when the samples are
derived from real human tissues. Importantly, however, in these
samples, the cell types that had been quantified were not completely
consistent with those that could be predicted by HCCImm and by the
other CCD tools. For example, in the [168]GSE107990 dataset, only the
fraction of the precursor of macrophages, namely, monocytes, had been
determined, but not the different polarization forms of macrophages.
Another example is that only CD4^+ T cells were experimentally
determined for this PBMC microarray dataset, whereas HCCImm is able to
predict the fractions of five subtypes of CD4^+ T cells, including
naïve CD4^+ T, Treg, Th1, Th2, and memory CD4^+ T cells. To compare the
predicted fractions with the original differential cell type data
during this benchmark, the fractions predicted by HCCImm for various
polarization forms of macrophage M1 and M2 cells were summed to give an
estimate of the total monocyte fraction in the samples. The full table
mapping the predicted cell types against the experimental cell type
dataset is provided in Table [169]S4(A).
The results show that HCCImm was superior to the other regularized
linear regression approaches tested in this study, as well as publicly
available tools such as quanTIseq and CIBERSORTx. Specifically, our
approach has a significantly higher median value for the correlation
coefficients (~0.86) between the predictions and the experimental PBMC
microarray datasets (Figure [170]4).
FIGURE 4.
FIGURE 4
[171]Open in a new tab
Boxplots of the correlation coefficients of the predictions against the
human PBMC microarray sample dataset [172]GSE107990. Elastic, L1, and
L2 correspond to the linear regression methods subject to elastic net,
L1‐regularization, and L2‐regularization, respectively. (A) Boxplots of
the Pearson's correlation coefficients for HCCImm, L1‐, L2‐, and
elastic net‐regularized methods. (B) Boxplots of Pearson's correlation
coefficients for HCCImm, CIBERSORTx, and quanTIseq.
However, when the nine RNA‐seq samples from human PBMCs with known cell
fractions were used for benchmarking, it could be seen that the
predictions made by quanTIseq, a method originally designed to analyze
RNA‐seq data, had a high correlation with the experimental cell
fractions. In this benchmark test, the performance of HCCImm was better
than that of quanTIseq, with the median values of the correlation
coefficients being ~0.88 and ~ 0.85, respectively, whereas higher
variances in the correlation coefficients can be seen for the
predictions made by CIBERSORTx. The predictions made by HCCImm have
lower mean square errors (MSEs) than those of the other two methods
(Figure [173]5A). The median error of the predictions made by HCCImm
was closer to zero (Figure [174]5B). The assessment suggests that both
quanTIseq and our ε‐SVR‐based method, HCCImm, outperform CIBERSORTx
when real human RNA‐seq samples are used. When using the expression
signals derived from multiple cell types, the quanTIseq approach could
quantify only ten types of immune cells. Although the refGES used in
this study was originally built to utilize microarray samples, this
benchmark test supports that HCCImm can also be applied to analyze
RNA‐seq samples.
FIGURE 5.
FIGURE 5
[175]Open in a new tab
(A) Corrplot of Pearson's correlation coefficients and (B) boxplots of
the prediction errors using the experimental cell fractions from the
human PBMC RNA‐seq dataset.
4.3. Simulated bulk tissues with different fractions of HCC cells
Next, an association analysis between the immune cell compositions and
the survival of HCC patients, as well as other cancer features, was
performed. This part of the study was designed to take advantage of the
gene expression data of the TCGA‐LIHC dataset along with the available
clinical data, including survival time. A benchmark test to reveal the
performance of HCCImm when applied to the bulk expression profiles of
HCC samples was developed. However, bulk GEPs are seldom accompanied by
experimentally determined cell fractions. Therefore, we simulated the
bulk expression profiles of the HCC samples to assess our CCD method.
Since the dominant cell type in the TME is most likely tumor cells,
they should make up a high proportion of the bulk GEP. Therefore, we
created synthetic datasets that contained a fixed percentage (x%) of
gene expression signals derived from HCC cells, and the remaining
fraction (100% ‐ x%) was composed of signals from the sixteen types of
immune cells that were randomly sampled and weighted. The results
revealed that HCCImm, our ε‐SVR‐based method, was quite robust, showing
consistently high correlations between the predicted cell fractions and
predefined fractions of tumor cells (Figure [176]6). Although the
predictions made by quanTIseq had higher correlations with original
data, when the fractions of HCC cells in the simulated datasets were
higher than 75% (Figure [177]6), the predictions made by HCCImm had
significantly smaller AEs than those of the predictions made by
quanTIseq and CIBERSORTx (Figure [178]7A–F).
FIGURE 6.
FIGURE 6
[179]Open in a new tab
Boxplots of Pearson's correlation coefficients of the stimulated
dataset predictions compared to the original data. The plots were
generated using the simulated bulk HCC samples for each benchmark based
on the gene expression signals with a fixed fraction of HCC cells,
namely, 15%, 30%, 45%, 60%, 75%, and 90%. Each data point in each
boxplot represents a simulated HCC sample.
FIGURE 7.
FIGURE 7
[180]Open in a new tab
Boxplots of the predicted cell fraction absolute errors (AEs). A–F: The
boxplots correspond to each benchmark based on the gene expression
signals with a fixed HCC cell fraction, namely, 15%, 30%, 45%, 60%,
75%, and 90%, in the simulated bulk tissues. G,H: The boxplots
correspond to the benchmark tests using the microarray and RNA‐seq
samples of PBMCs, respectively.
Furthermore, to evaluate the effect of refGES on the cell deconvolution
model, we utilized our refGES instead of the default LM22 matrix to
carry out CIBERSORTx analysis, and this test was labeled
CIBERSORTx.ourref. Based on the correlation between the
CIBERSORTx.ourref result and the original data of the simulated
datasets, we determined that its performance was better than that of
CIBERSORTx with default LM22, albeit still not as robust as HCCImm.
However, for the previous benchmark using pure‐type samples, the
performances of L1‐regularized, L2‐regularized, and elastic
net‐regularized linear regressions were slightly better than that of
HCCImm (Figure [181]3). This benchmark using simulated HCC samples
supports the hypothesis that our ε‐SVR‐based method, HCCImm, together
with the new refGES, is likely to be more suitable for analyzing HCC
bulk tissue samples that contain multiple cell types.
4.4. Association of immune cell abundance with HCC patient survival and TMB
The ultimate goal of this study was to explore whether there is an
association between immune cell abundance and the clinical features of
HCC patients, such as survival time and TMB. Among the TCGA‐LIHC
patients, 369 patients had complete survival information, and these
patients were extracted for cell fraction prediction
(Figure [182]1C,D). The raw counts of the RNA‐seq samples of these
patients were converted to TPM values and transformed as mentioned in
the “Methods” section.
The strategy used for the association analysis between the immune cell
fractions and patient survival is as follows. First, the cell fractions
of the 369 TCGA‐LIHC RNA‐seq samples were predicted. The patients were
then categorized into four groups based on their risk histories
(Figure [183]8). The association of the cell fractions with the OS of
the HCC patients was then assessed by the log‐rank test and univariate
Cox regression analyses (Figure [184]1D). To consider the potential
bias when performing cross‐sample comparisons of the abundance of each
immune cell type, we decided that the predicted HCC cell fractions
should be removed and the remaining fractions of the sixteen types of
immune cells in each sample were then normalized to 100%
(Figure [185]1C). We argue that the potential variations in the
resection procedure during tumor sample acquisition may have affected
the RNA‐seq results, such that the proportion of gene expression
signals derived from HCC cells across different samples. For the
respective association analysis between immune cell abundance and
patient prognosis, the Kaplan–Meier method was used to create survival
curves. For each immune cell type, the 369 HCC patients were divided
into high and low groups. The high group represented the patients with
a predicted fraction of one immune cell type higher than the median
value, and vice versa. By using this approach, it was found that the
abundance of three immune cell types was significantly associated with
HCC patient OS (Figure [186]9). HCC patients with higher fractions of
memory CD8^+ T cells and Treg cells had a worse prognosis (log‐rank
test p‐values of 0.055 and 0.037, respectively). In contrast, HCC
patients with a higher fraction of naïve CD8^+ T cells were found to
have a significantly better prognosis (log‐rank test p‐value: 0.013).
In addition, for subgroups of HCC patients carrying risk factors,
including HCV and alcohol consumption, a higher fraction of Tregs and a
higher fraction of memory CD8^+ T cells were associated with a worse
prognosis (Supplementary Figure [187]S8).
FIGURE 8.
FIGURE 8
[188]Open in a new tab
TCGA‐LIHC patients were categorized into four risk groups.
FIGURE 9.
FIGURE 9
[189]Open in a new tab
Kaplan–Meier survival curves of HCC patients were created by dividing
patients into high and low groups corresponding to high and low
abundances of naïve CD8^+ T cells, Treg cells, and memory CD8^+ T
cells, A, B, and C, respectively. The red lines indicate the high
group, while the green lines indicate the low group.
The Kaplan–Meier analysis indicated that the higher abundance group of
naïve CD8^+ T cells was a significant factor for the survival of HCC
patients (Figure [190]9). Two subtypes of CD8^+ T cells are associated
with HCC patient prognosis. Patients were divided into four groups
based on their proportions of naïve and memory CD8^+ T cells: higher
proportions of naïve CD8^+ T cells and higher proportions of memory
CD8^+ T cells (naïve_h&mem_h), higher proportions of naïve CD8^+ T
cells and lower proportions of memory CD8^+ T cells (naïve_h&mem_l),
lower proportions of naïve CD8^+ T cells and higher proportions of
memory CD8^+ T cells (naïve_l&mem_h), and lower proportions of naïve
CD8^+ T cells and lower proportions of memory CD8^+ T cells
(naïve_l&mem_l). The Kaplan–Meier plot analysis revealed the
association between these CD8^+ T‐cell subtypes and patient prognosis.
The results showed that patients with a higher proportion of naïve
CD8^+ T cells had a better prognosis, regardless of the memory CD8^+ T
cell proportion. This finding suggests that the proportion of naïve
CD8^+ T cells could serve as a valuable prognostic marker for HCC
patients (Supplementary Figure [191]S9).
To further investigate which immune cell types might be critical to HCC
patient survival, univariate Cox regression analysis (Cox, 1972) was
performed to assess whether there was a correlation between immune cell
fractions and HCC patient OS. The hazard ratios (HRs) and 95%
confidence intervals (CIs) of the sixteen immune cell types are shown
in Figure [192]10. These results revealed that two types of immune
cells, M2a cells, and naïve CD8^+ T cells, seem to have a positive
impact on HCC patient OS. In contrast, memory CD8^+ T cells seem to
have a negative impact on HCC patient OS (p‐value = 0.07). Furthermore,
higher abundances of Treg cells and naïve B cells were found to be
significantly associated with a worse OS (Treg: HR = 1.07, 95%
CI = 1.02–1.12, p‐value = 0.01) (B_naïve: HR = 1.16, 95%
CI = 1.00–1.35, p‐value = 0.05).
FIGURE 10.
FIGURE 10
[193]Open in a new tab
Cox regression analysis between immune cell abundance and the OS of 369
HCC patients using the proportions of sixteen immune cell types as the
independent variable.
Moreover, progression‐free survival (PFS), in addition to OS, can also
provide valuable information, which shows how long a patient's cancer
remains stable or does not worsen after the initial diagnosis.[194] ^61
The results revealed that in the TCGA‐LIHC patients, the proportions of
naïve CD8^+ T cells and Tregs were significantly associated with PFS.
Naïve CD8^+ T cells seem to have a positive impact on HCC patient PFS,
whereas Treg cells are likely to have a negative impact on HCC patient
PFS, suggesting that these immune cell subsets may play a role in
cancer progression and the response to treatment (Supplementary
Figure [195]S10).
We further categorized the HCC patients into distinct risk groups based
on factors such as chronic hepatitis and alcohol consumption. Each risk
group is likely to represent a distinct type of underlying condition
related to chronic inflammation. For example, hepatitis B virus (HBV)
carriers often suffer from chronic inflammation of the liver due to
recurrent reactivation of their HBV infection,[196] ^62 and
HBV‐specific CD8^+ T cells then infiltrate into their liver and acquire
a specific phenotype.[197] ^63 Moreover, alcohol consumption might lead
to a proinflammatory effect involving Kupffer cells, which has been
shown to reduce the recruitment of CD8^+ T cells.[198] ^64 A number of
associations between immune cell abundance and the various risk groups
have been identified. For example, the HR of Treg cells in the OS of
HBV‐HCC patients is 1.28, which is higher than that of all other HCC
patient groups. Interestingly, unlike the survival of HBV‐HCC patients,
the abundance of Treg cells does not appear to be significantly
associated with the survival of HCC patients in the other risk groups
(Figure [199]11). We noticed that in addition to the HBV group, the
survival of the hepatitis C virus (HCV) group might be associated with
the abundance of Tregs. A Cox hazard analysis for PFS in the HCV‐HCC
group showed a HR = 1.12 and p‐value <0.05 for Treg abundance
(Supplementary Figure [200]S10). We reviewed the literature and noticed
that this association between Treg cell abundance and viral risk group
survival might reflect the immunosuppressive role of Treg cells in the
progression of chronic viral hepatitis, which might not occur in
non‐viral HCC risk groups. In hepatitis viral infections, Treg cells
play a protective role in the host by suppressing immune‐mediated
mechanisms of liver damage.[201] ^65 Another study also provided
evidence that HBV infection‐related factors could lead to the expansion
of Tregs and enhance their suppressor function, which in turn could
dampen the antitumor immune response to HCC tumor antigens and hinder
the immune surveillance of HCC tumors.[202] ^66
FIGURE 11.
FIGURE 11
[203]Open in a new tab
Cox regression analysis of immune cell abundance in HBV‐HCC patients
using the fraction of 16 immune cell types as the independent variable.
Another association was found between the M2 polarization subtypes, M2a
and M2_IL4_DEX, and the OS of HCC patients with HBV and alcohol
consumption risk factors. The HBV‐HCC patients and the alcohol‐HCC
patients who had higher fractions of M2a (Figure [204]11) and
M2_IL4_DEX (Figure [205]12), respectively, were found to have HRs
significantly lower than 1.0. In contrast, the immune cell types
associated with increased HRs among alcohol‐HCC patients are different
from those associated with increased HRs among HBV‐HCC patients.
Alcohol‐HCC patients with a higher abundance of memory CD8^+ T cells
and naïve CD4^+ T cells have a worse prognosis (Figure [206]12).
However, HCV‐HCC and HCC patients without other risk factors did not
have a significant hazed ratio (Supplementary Figures [207]S11 and
S12).
FIGURE 12.
FIGURE 12
[208]Open in a new tab
Cox regression analysis of the immune cell abundance of alcohol‐HCC
patients using the fraction of 16 immune cell types as the independent
variable. Note: *p < 0.05.
4.5. Association between immune cell fraction and TMB
Cancer immunotherapy has changed the treatment landscape for cancer
patients. However, in terms of the treatment of HCC, only a small
subset of patients respond well to immunotherapy.[209] ^67 Therefore, a
better understanding of the interaction between HCC cells and immune
cells in the TME should provide insights into possible prognostic and
predictive biomarkers that can then be used to select patients who
might benefit from ICI‐based treatment.[210] ^67 A high TMB has been
proposed as a biomarker to predict the response to immune checkpoint
blockade (ICB) based on the hypothesis that increased tumor mutations
lead to the production of more antigenic peptides, which enhances
immunogenicity.[211] ^68 In contrast, available evidence suggests that
an association between TMB and immune signatures is dependent on the
cancer type, and a high TMB has been associated with a poor prognosis
among TCGA‐LIHC patients (Supplementary Figure [212]S13).[213] ^6
Therefore, we investigated whether the TMB level of HCC patients was
correlated with the levels of certain immune cell types that might be
associated with immunosuppression in the HCC TME.
We used HCCImm to investigate the association between TMB and the
abundance of leukocytes in TCGA‐LIHC samples. The TCGA mutation calls
were acquired from the MC3 calls in the Pancancer Atlas
([214]https://gdc.cancer.gov/about‐data/publications/pancanatlas).[215]
^54 We noticed that when the LIHC samples had a high TMB (TMB‐H), there
were significantly lower fractions of the M2 cell subtypes (t‐test,
p‐value = 0.017), and this was linked to a significantly higher sum for
all of the fractions of the other leukocytes, specifically excluding
macrophages (t‐test, p‐value = 0.001). An association of TMB‐H with M2
cell levels and other tumor‐infiltrating lymphocyte (TIL) cell levels
was also observed in the HBV‐HCC patient subgroup. In the TMB‐H group,
the subset of patients with a higher fraction of Treg cells had a worse
prognosis (log‐rank test p‐value: 0.039) (Figure [216]13A), whereas the
subset of patients with a higher fraction of naïve CD8^+ T cells had a
significantly better prognosis (log‐rank test p‐value: 0.044)
(Figure [217]13B).
FIGURE 13.
FIGURE 13
[218]Open in a new tab
Kaplan–Meier survival curves of HCC patients with high TMB were
obtained by dividing the patients into two groups based on their level
of (A) Treg cells and (B) naïve CD8^+ T cells.
Moreover, to investigate the potential impact of Treg abundance on the
relationship between HCC patient OS and their TMB levels, we generated
waterfall plots separately for high‐ and low‐Treg groups (Supplementary
Figure [219]S14). The y‐axis of the plots represents the estimated
survival scale, which was calculated using a min‐max scaling method
based on patients' survival time. These plots revealed that, within the
first 2 years after diagnosis, HCC patients with high TMB and Treg‐high
status experienced a significantly shorter overall survival time
(survival scale y < 0.2, Fisher's exact test, p‐value of 0.017,
Supplementary Figure [220]S14A). In contrast, in Treg‐low status, the
result shows that the TMB levels were not associated with survival
(Supplementary Figure [221]S14B). This finding provides further support
for the notion that Treg abundance may play a role in the progression
of HCC in patients with high TMB levels.
4.6. Immune cell levels and anticancer drug response among TCGA‐LIHC patients
We explored the association between immune cell abundance and patient
response to treatment. Although data on the responses to ICB therapy
are not available for TCGA‐LIHC patients, a few records about the
responses to cancer therapy have been documented in TCGA clinical data.
Previously, Ding et al. curated drug treatment information of TCGA
patients (Supplementary Figure [222]S15), wherein 180 clinical response
records related to four chemotherapeutic drugs were obtained to
determine the molecular features that can predict outcomes.[223] ^54
Here, we took advantage of these records to assess the predictive
utility of immune cell levels with respect to clinical drug responses
among TCGA‐LIHC patients. Interestingly, all four of the LIHC patients
that showed complete or partial responses to treatment had higher
levels of DCs and naïve CD8^+ T cells (Fisher's exact test,
p‐value = 0.057) (Table [224]2A). When we focused only on the HCC
patient subgroup that had a high level of memory CD8^+ T cells or
Tregs, a high abundance of naïve CD8^+ T cells was significantly
associated with a greater treatment response (Fisher's exact test,
p‐value = 0.042) (Table [225]2B).
TABLE 2.
The 2 × 2 tables for TCGA‐LIHC patients with drug response records have
high and low levels of naïve CD8^+ T cells. (A) All LIHC patients. (B)
The HCC patient subgroup with a high level of memory CD8^+ T cells or
Tregs.
A
Responder No treatment records
High naïve CD8^+ T 4 169
Low naïve CD8^+ T 0 180
B
Responder No treatment records
High naïve CD8^+ T 3 70
Low naïve CD8^+ T 0 135
[226]Open in a new tab
5. DISCUSSION
5.1. Bulk RNA‐seq and differential polarizations
Our ε‐SVR‐based method, HCCImm, was designed to use the bulk GEPs of
HCC samples to predict the abundance of different major immune cell
types such as B cells, CD8^+ T cells, CD4^+ T cells, and macrophages.
The benchmark testing revealed that HCCImm had at least a comparable
performance to other immune cell quantification methods when used to
analyze bulk RNA‐seq samples. Furthermore, HCCImm was also able to make
predictions for more cell types than earlier methods, such as
quanTIseq, which was specifically designed to analyze RNA‐seq data. In
addition, when analyzing the TCGA‐LIHC dataset, for >95% (355/369) of
the cases, our predictions for cell composition passed the Monte Carlo
test (p‐value <0.05). Therefore, HCCImm was able to predict the
fractions of immune cell subtypes, which were then correlated with HCC
patient OS.
5.2. Correlation between OS, treatment response, and immune cell composition
One important issue concerns the association between TAMs and HCC
patient OS. TAMs with the M2 phenotype have been proposed to be key
players in cancer‐related inflammation and are involved in promoting
the survival, growth, and metastasis of cancer cells, as well as the
suppression of adaptive immunity.[227] ^69 However, in previous
computational studies investigating immune cell composition, there have
been no consistent findings with respect to an association between the
level of macrophage polarization and HCC patient OS. For example,
Manoharan et al. reported that a high level of monocytes, but not
macrophages, was negatively associated with TCGA‐LIHC patient OS.[228]
^70 Foerster et al. stated that the absence of macrophages and Th2
cells were positively correlated with TCGA‐LIHC patient OS. Peng et al.
found that TCGA‐LIHC patients in the poor prognosis group had higher
levels of M0 and M2 cells.[229] ^33
Using HCCImm, we have been able to predict the cell fractions in the
TCGA‐LIHC samples; this analysis included multiple polarized forms of
macrophages, not just the binary M1 and M2 cell categories. After the
analysis was carried out, we found no significant association with
patient OS. Hourani et al.[230] ^71 reported that the subtypes of M2
macrophages that are associated with poor cancer prognosis differ
across various cancer types. In liver cancer, M2 cells have been
commonly associated with poor prognosis. However, it remains unclear
how the subtypes of M2 cells specifically impact different types of
liver cancer. When we further investigated the subgroups of HCC
patients with distinct risk histories, the HBV‐HCC patients and
alcohol‐HCC patients with higher levels of M2a (Figure [231]11) and
M2_IL4_DEX (Figure [232]12), respectively, were found to have HRs
significantly lower than 1.0. This result suggests that not all subsets
of M2 cells are associated with a poor HCC patient prognosis.
This result seems to contradict the common viewpoint that TAMs with an
M2 phenotype are associated with a poor prognosis among HCC patients.
Perhaps because we have been unable to find a negative association of
any macrophage subtype with HCC patient OS, the results might reflect
the possibility that the selected macrophage subtypes in this study do
not correspond well to the survival‐associated subsets of macrophages.
Since TAMs are composed of heterogeneous subsets of functionally
distinct macrophages, perhaps only a subset of TAMs might have an
important association with patient survival, which is supported by
recently published research using scRNA‐seq to investigate immune cells
in the HCC TME.[233] ^72 Nonetheless, our results also indicate that
some subsets of macrophages may improve the prognosis of several HCC
patient subgroups. Notably, several meta‐analyses have also suggested
that TAMs are not necessarily associated with poor cancer patient
prognosis.[234] ^73 , [235]^74 , [236]^75
CD8^+ T cells are another type of immune cell that is critical for
tumor control and is known to be associated with a favorable HCC
prognosis.[237] ^72 , [238]^76 Our analyses using the log‐rank test and
Cox univariate regression revealed that HCC patients with a higher
fraction of naïve CD8^+ T cells had a favorable prognosis. In contrast,
the abundance of another CD8^+ T‐cell subset, memory CD8^+ T cells,
seems to be associated with a HR significantly larger than 1.0
(Figure [239]10). This result suggests that in the HCC TME, there is a
positive association between memory CD8^+ T cells and patient OS, which
is similar to that found for Treg immunosuppressive cells
(Figure [240]10). At first glance, this finding also appears to
contradict our current knowledge related to tumor‐infiltrating
cytotoxic CD8^+ T cells, which have been shown to suppress the growth
of tumor cells and therefore should be beneficial for patient OS.
Nevertheless, CD8^+ T cells are known to be able to differentiate into
exhausted T cells within the TME of various cancer types, including
HCC. Furthermore, memory CD8^+ T cells in the HCC TME seem to present
with a T‐cell exhaustion phenotype.[241] ^19 Our results showing the
negative association between memory CD8^+ T cells and HCC patient OS
might reflect the suppression of CD8^+ T cell cytotoxicity via the
activation of the immune checkpoint pathway by PD‐L1^+ cells
(Supplementary Figure [242]S16). Interestingly, based on the
predictions made by HCCImm, we also found that a high level of naïve
CD8^+ T cells in the HCC TME seems to be associated with patient
response to cancer treatment (Table [243]2), although the available
data are fairly limited. Overall, in the HCC TME, our results support
the hypothesis that naïve CD8^+ T cells have a positive impact on
patient survival and drug response.
5.3. Correlation of immune cell levels with TMB in TCGA‐LIHC patients
By applying CIBERSORT to quantify 22 immune cell types, McGrail et al.
reported that in TCGA‐LIHC samples, the abundance of CD8^+ T cells was
not positively correlated with neoantigens.[244] ^68 A similar result
was found in our study; however, we additionally assessed the
association between TMB and other immune cell types. When assessing all
369 HCC patients, a significant association of TMB‐H with the abundance
of M2 cells and other TIL cells was found. The same association was
also found in the subgroup of HBV‐HCC patients, although it was absent
when subgroups with different risk histories (i.e., HCV, alcohol, and
no history of risk factors) were assessed. The significant association
of TMB‐H with TIL cells in TCGA‐LIHC patients has also been previously
reported by Wang and Li.[245] ^6
The association between TMB‐H and TIL cells in HBV‐HCC patients may
suggest that during the chronic inflammation caused by the HBV, immune
cells such as HBV‐specific CD8^+ T cells directly attack infected
hepatocytes and subsequently recruit other components of the immune
system, which then leads to immunopathogenesis and further hepatic
damage.[246] ^63 Since there is no such finding for HCV‐HCC patients
and alcohol‐HCC patients, one interpretation is that distinct histories
in terms of risk factors can lead to a range of different
carcinogenesis mechanisms. For example, chronic alcohol consumption
decreases the number of antitumor CD8^+ T cells.[247] ^77
Yan et al.[248] ^77 investigated the abundance of antitumor CD8^+ T
cells in alcohol‐HCC patient samples and found that the advanced TME
typically contains fewer of these cells than the early TME. In chronic
HCC caused by HCV infection, CD8^+ T cells were found to be
exhausted.[249] ^78 Using the HCCImm algorithm to analyze TCGA‐LIHC
data, we found a significant difference in the abundance of naïve CD8^+
T cells between early‐stage (stage I) and advanced‐stage (stage III and
IV) alcohol‐related liver cancer. Specifically, our study revealed a
lower abundance of naïve CD8^+ T cells in the advanced stage,
suggesting a depletion of this cell population as the disease
progresses (Figure [250]S17).
Our results revealed that TMB‐H HCC patients, particularly the subgroup
consisting of HBV‐HCC patients, are likely to have significantly higher
levels of TILs and significantly lower levels of M2 macrophages.
Therefore, HBV‐HCC patients with a high TMB might be a more suitable
target group for ICB therapy than the HCC patient groups that have
other risk histories, such as hepatitis C and chronic alcohol
consumption.
6. CONCLUSION
We built a new CCD tool, HCCImm, to estimate the abundance of sixteen
immune cell types present in HCC gene expression samples. We have
demonstrated its performance by analyzing our results against
experimentally known profiles, including microarray and RNA‐seq
profiles. This benchmark testing showed that HCCImm is a more robust
approach compared to other available CCD methods. By applying this new
tool to analyze TCGA‐LIHC samples and conducting various survival
analyses, we found that there is a significant association between the
abundance of naïve CD8^+ T cells and memory CD8^+ T cells and HCC
patient prognosis. In addition, we also noticed that patients with a
high TMB and a high abundance of naïve CD8^+ T cells had significantly
better prognoses. Moreover, we found that the response of HCC patients
to cancer treatment might be positively associated with the abundance
of naïve CD8^+ T cells. Overall, HCCImm enables a robust and
comprehensive exploration of immune cell composition in the HCC TME,
and our findings are highly consistent with results derived from
experimental investigations, which have generally focused on only a
limited range of immune cell types.
AUTHOR CONTRIBUTIONS
Yen Jung Chiu: Conceptualization (equal); data curation (equal); formal
analysis (equal); investigation (equal); methodology (equal); project
administration (equal); resources (equal); software (equal);
supervision (equal); validation (equal); visualization (equal); writing
– original draft (equal); writing – review and editing (equal).
Chung‐En Ni: Validation (equal). Yen‐Hua Huang: Conceptualization
(equal); funding acquisition (equal); investigation (equal);
methodology (equal); project administration (equal); supervision
(equal); validation (equal); writing – original draft (equal); writing
– review and editing (equal).
FUNDING INFORMATION
This work was supported by grants from the Ministry of Science and
Technology (MOST), Taiwan (MOST108‐2320‐B‐010‐041,
MOST109‐2221‐E‐010‐017‐MY3), and the Higher Education Sprout Project by
the Ministry of Education (MOE), Taiwan (108 AC‐D102).
CONFLICT OF INTEREST STATEMENT
The authors declare that they have no conflicts of interest.
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable.
Supporting information
Figure S1–S17.
[251]Click here for additional data file.^ (445.1KB, pdf)
Table S1.
[252]Click here for additional data file.^ (18.1KB, xlsx)
Table S2.
[253]Click here for additional data file.^ (11.6KB, xlsx)
Table S3.
[254]Click here for additional data file.^ (11.8KB, xlsx)
Table S4A–C.
[255]Click here for additional data file.^ (11.5KB, xlsx)
Table S5.
[256]Click here for additional data file.^ (50.8KB, xlsx)
ACKNOWLEDGMENTS