Abstract

Background

   Many studies have utilized computational methods, including cell
   composition deconvolution (CCD), to correlate immune cell polarizations
   with the survival of cancer patients, including those with
   hepatocellular carcinoma (HCC). However, currently available cell
   deconvolution estimated (CDE) tools do not cover the wide range of
   immune cell changes that are known to influence tumor progression.

Results

   A new CCD tool, HCCImm, was designed to estimate the abundance of tumor
   cells and 16 immune cell types in the bulk gene expression profiles of
   HCC samples. HCCImm was validated using real datasets derived from
   human peripheral blood mononuclear cells (PBMCs) and HCC tissue
   samples, demonstrating that HCCImm outperforms other CCD tools. We used
   HCCImm to analyze the bulk RNA‐seq datasets of The Cancer Genome Atlas
   (TCGA)‐liver hepatocellular carcinoma (LIHC) samples. We found that the
   proportions of memory CD8^+ T cells and Tregs were negatively
   associated with patient overall survival (OS). Furthermore, the
   proportion of naïve CD8^+ T cells was positively associated with
   patient OS. In addition, the TCGA‐LIHC samples with a high tumor
   mutational burden had a significantly high abundance of nonmacrophage
   leukocytes.

Conclusions

   HCCImm was equipped with a new set of reference gene expression
   profiles that allowed for a more robust analysis of HCC patient
   expression data. The source code is provided at
   [28]https://github.com/holiday01/HCCImm.

   Keywords: bulk gene expression profiles, CD8 T cell, deconvolution,
   hepatocellular carcinoma, immune cell polarization, therapy response
   subtypes
     __________________________________________________________________

   The HCCImm model, a cell composition deconvolution tool, was developed
   to accurately estimate the abundance of 17 cell types within the
   microenvironment of hepatocellular carcinoma (HCC). The differentially
   expressed genes (DEGs) were used to construct an HCC reference gene
   expression signature matrix. Subsequently, predicted abundance of HCC
   cells were excluded, and the proportions of immune cell types were
   normalized. Survival analysis was performed to investigate the
   relationship between immune cell abundance and patient prognosis. This
   comprehensive workflow enables a detailed exploration of gene
   expression patterns and their impact on patient outcomes. Notably, the
   abundance of CD8 T cells has been identified as a significant factor
   associated with prognosis and the effectiveness of drug treatments.

   graphic file with name CAM4-12-15736-g012.jpg

1. INTRODUCTION

   Primary liver cancer is the seventh most common cancer type and the
   second most common cause of cancer‐related mortality in the world,[29]
   ^1 and hepatocellular carcinoma (HCC) accounts for nearly
   three‐quarters of all liver cancer cases and is the dominant type of
   liver cancer.[30] ^1 Treatment of advanced HCC is challenging, and
   systemic chemotherapy has toxic side effects and no survival
   benefits.[31] ^2 , [32]^3 Recently, immunotherapy, in particular immune
   checkpoint inhibition (ICI), has been used as a clinical treatment for
   HCC. By blocking the cancer cell‐mediated suppression of immune cells,
   they can be reactivated and kill cancer cells. ICI has been approved as
   a first‐line treatment for advanced HCC cases[33] ^4 ; however, only a
   subset of patients seem to benefit from ICI therapy. It has been
   hypothesized that the poor ICI response rate might be related to
   individual differences in the types and relative abundances of immune
   cells in the tumor microenvironment (TME).[34] ^5 In addition, Wang and
   Li suggest that ICIs might not be a particularly favorable method to
   treat HCC because immune checkpoint genes are not highly expressed in
   LIHC patients that have a high tumor mutational burden (TMB),[35] ^6
   which suggests that the characteristics of the HCC TME are different
   from those of other cancers.

   The TME is a complex ecosystem that consists of a range of different
   types of cells; these different cell types include not only cancer
   cells but also numerous interstitial cells and infiltrating immune
   cells. The abundance and density of these tumor‐infiltrating immune
   cells (TIICs)[36] ^7 may be associated with patient survival and cancer
   treatment efficacy. TIICs seem to interact with cancer cells or other
   noncancer interstitial cells in ways that are not yet fully
   understood.[37] ^8 Immune cells might be able to modulate cancer
   progression by regulating the growth and metastasis of tumor cells.
   There are also likely to be interactions between immune cells via
   cytokines or surface proteins, which creates a complex network among
   immune cells.[38] ^9 , [39]^10

   Immune cells are regulated by a variety of cytokines as well as by
   cell–cell contacts, which results in their polarization to perform a
   number of distinct functions. The immune cells present in a given
   sample may have been polarized in a number of different directions, and
   these complex changes impose unexpected influences on the prognosis of
   cancer patients. In recent years, many studies have found that
   different polarizations of the same immune cell type may have diverse
   effects on cancer cells. Some polarized forms are able to inhibit the
   growth of cancer cells or even kill them, while others are able to
   promote the growth and/or metastasis of cancer cells.[40] ^11 For
   example, macrophage cells can be polarized into classically activated
   macrophages (M1), or alternatively, they can become alternatively
   activated macrophages (M2). M2 cells promote the growth of cancer
   cells, while M1 cells inhibit the differentiation of cancer cells.[41]
   ^12 A few studies have reported that a high M1/M2 ratio is associated
   with greater survival of cancer patients.[42] ^13 , [43]^14
   Furthermore, M2 cells can be further divided into four subtypes,
   namely, M2a, M2b, M2c, and M2d, each of which has been suggested to
   exert different effects within the TME. Tumor‐associated macrophages
   (TAMs), which have an immunosuppressive role and protumor properties,
   are referred to as M2d cells.[44] ^15 , [45]^16

   In addition, CD8^+ T cells and their functional subsets have been
   highlighted in many studies due to their roles in cancer
   immunotherapy.[46] ^17 A higher density and a greater abundance of
   CD8^+ T cells in the TME have been associated with a favorable
   prognosis for cancer patients[47] ^18 ; however, in the TME of HCC
   patients, resident memory CD8^+ T cells may differentiate into
   exhausted phenotypes, and a higher abundance of such exhausted memory
   CD8^+ T cells has been associated with the poor prognosis of HCC
   patients.[48] ^19 In contrast, the activated CD8^+ T cells
   differentiated from naïve CD8^+ T cells, compared to those effector
   cells derived from memory CD8^+ T cells, exhibit a better resistance to
   tumor‐induced immune suppression and reveal a more potent
   tumor‐specific cytotoxic activity.[49] ^20

   Exploring which polarized immune cells are significantly associated
   with cancer patient prognosis is likely to allow new insights into the
   application and development of immunotherapy. A useful approach is to
   investigate the immune cell composition of publicly available tumor
   sample datasets. Although experimental methods such as flow cytometry
   and immunohistochemistry staining (IHC) are able to determine the
   abundance of immune cell types, such information is usually unavailable
   in public‐domain datasets.[50] ^21 Therefore, many cell composition
   deconvolution (CCD) methods have been developed to infer the immune
   cell admixture based on the bulk gene expression profile (bulk GEP)
   data of tissues. These CCD methods include linear least‐square
   regression,[51] ^22 , [52]^23 nonnegative matrix factorization,[53] ^24
   , [54]^25 , [55]^26 quadratic programming,[56] ^27 , [57]^28 and
   𝜈‐support vector regression.[58] ^29 Usually, when using a
   deconvolution method, a set of cell‐specific gene expression profiles
   have been curated and used as the reference (reference GEP), and this
   reference GEP is then used to predict the relative abundance of
   different immune cells from the bulk GEP of a given new cancer sample.
   Among these computational deconvolution approaches, CIBERSORTx
   developed by Newman et al. is one of the most widely used methods to
   predict the immune cell composition of a bulk GEP derived from a tumor
   sample.[59] ^29 , [60]^30 Moreover, marker gene‐based approaches, such
   as single sample Gene Set Enrichment Analysis (ssGSEA), are able to
   evaluate the expression values of the genes in a sample. These are
   characteristic for each of the selected cell types and therefore
   facilitate individual abundance estimations.[61] ^21

   When different cell type quantification methods are applied to analyze
   the bulk gene expression profiles of cancer patient tissue samples,
   discrepancies in the results can occur when the predicted associations
   of immune cells with HCC patient survival are explored. For example,
   regulatory T cells (Tregs) are an immunosuppressive subtype of CD4^+ T
   cells and are known to promote tumor growth. Tu. analyzed ICOS^+
   FOXP3^+ Tregs in 57 HCC patient tissue samples by immunohistochemistry
   (IHC) and reported that the infiltration of ICOS^+ FOXP3^+ Tregs showed
   a negative correlation with patient survival.[62] ^31 In contrast to
   this finding, Foerster et al. analyzed the TCGA‐LIHC dataset using a
   marker gene‐based approach and found that a higher abundance of Tregs
   was positively associated with patient survival.[63] ^32 Finally, Peng
   et al. analyzed the same dataset with CIBERSORT and reported that the
   abundance of Tregs was significantly higher in HCC patients with poor
   prognosis.[64] ^33

   These discrepancies among the results of previous studies seem to
   reflect the fact that some of the approaches used have a lower level of
   performance. According to benchmarks created from synthetic bulk gene
   expression profile datasets, deconvolution‐based approaches usually
   outperform marker gene‐based methods[65] ^21 ; however, it can be
   clearly argued that further study in this area is needed. When
   CIBERSORT was applied to predict immune cell compositions using cancer
   patient datasets, a few limitations were noticeable. For example, when
   using CIBERSORT to analyze the TCGA‐LIHC dataset, in as many as 80% of
   the cases, the cell composition predictions were unable to pass the
   Monte Carlo test (p‐value >0.05).[66] ^32 This finding suggests that
   when using CIBERSORT, the association between cell admixture and
   patient survival might be biased because only a small subset of LIHC
   patients was assessed. In addition, when deconvolution‐based methods
   are used, Sturm et al. noticed that nonspecific signature genes can
   cause high background predictions, or “spillover,” for certain cell
   types.[67] ^21 As cancer cells are highly likely to make up the largest
   cell group in a tumor and thus the highest proportion in the TME, we
   believe that there is a concern that, without specifically including
   cancer cells in the reference GEP, a deconvolution‐based approach might
   give rise to additional spillover due to the gene expression signals
   contributed by the cancer cells in the sample.

   Exploring the roles of infiltrating lymphocytes and other immune cells
   in the TME is a popular research topic. Available evidence supports the
   hypothesis that the cell heterogeneity present in the TME seems to have
   an impact on cancer patient survival and response to therapy.[68] ^34 A
   better understanding of the complexity present within the TME is likely
   to provide further insights that will allow the development of new HCC
   therapies. Based on recent studies of human HCC samples using
   single‐cell RNA‐seq,[69] ^35 , [70]^36 there is high cell heterogeneity
   in TME of liver cancer, containing a diversity of immune cell types,
   including dendritic cells (DCs), natural killer (NK) cells, CD4^+ T
   cells, CD8^+ T cells, and TAMs. Zhang et al. also found that Treg and
   exhausted CD8^+ T cells were enriched in human HCC tumor tissues and
   suggested that it is necessary to clarify their association with HCC
   patient survival.[71] ^35

   CCD methods may have limitations due to spillover from the overlapping
   gene expression profiles among different cell types.[72] ^21 Various
   CCD tools, including CIBERSORTx,[73] ^30 EPIC,[74] ^37 TIMER,[75] ^38
   Dtangle,[76] ^39 TumorDecon,[77] ^40 and SCDC,[78] ^41 have been
   developed to predict the abundance of immune cells. For example,
   CIBERSORTx can predict 22 types of immune cells. EPIC can estimate the
   abundance of eight cell types, including immune cells and
   cancer‐associated fibroblasts. TIMER provides immune cell estimations
   for diverse cancer types. Dtangle can estimate 11 types of immune cells
   in Lyme disease. TumorDecon can predict 12 types of immune cells for
   quantitative pharmacology models. SCDC uses an ENSEMBLE approach to
   develop CCD algorithms. However, despite the improved capabilities of
   these tools, spillover from the gene expression profiles of HCC cancer
   cells remains a challenge for CCD methods.

   Our tool, HCC Immune Estimating (HCCImm), has been explicitly designed
   to analyze the immune TME in liver cancer by a unique reference gene
   expression signature matrix (refGES) built from a liver cancer cell
   line to improve the accuracy of cell type deconvolution. As a result,
   HCCImm can more accurately distinguish between immune cells and liver
   cells than other CCD methods.

   Therefore, this study aims not only to create a new CCD method that can
   be applied to analyze the TCGA‐LIHC dataset but also to more diversely
   quantify the polarized populations of immune cells. The inferred immune
   cell compositions could be associated with HCC patient survival. To
   this end, we built our reference gene expression profile by including a
   greater range of immune cell types. For example, we have included more
   differentially polarized forms of macrophages than just the binary
   categorization into M1 and M2. We also expanded the types of CD8^+ T
   cells included and further divided them into two subtypes, namely,
   naïve and memory CD8^+ T cells, as each has specific immune
   characteristics. Cytotoxic T cells expressing the CD8 cell‐surface
   marker are the most powerful effectors associated with the cancer
   immune response,[79] ^42 but in the TME, tissue‐resident memory CD8^+ T
   cells may express various dysfunctional markers, such as PDCD1 and
   CTLA4[80] ^43 ; these markers have also been included. As a comparison,
   CCD methods such as CIBERSORTx and quanTIseq can predict only a single
   fraction value for the CD8^+ T‐cell type.[81] ^29 , [82]^30 , [83]^44

2. STUDY DESIGN

   The workflow overview of this study is presented in Figure [84]1. To
   construct the reference gene signature matrix, which was used for
   regression of the immune cell types in HCC samples, we collected the
   gene expression profiles of 17 distinct types of cells (Figure [85]1A).
   These cell types were an HCC cell line, memory B cell, naïve B cells,
   DC cells, M1 cells induced by interferon γ (IFN γ), M1 cells induced by
   IFN γ, and tumor necrosis factor (TNF), M1 cells induced by
   lipopolysaccharide (LPS), M2 cells induced by IL4 and dexamethasone
   (DEX), M2a cells induced by IL13 or IL4, M2c cells induced by IL10, NK
   cells, naïve CD8^+ T cells, memory CD8^+ T cells, naïve CD4^+ T cells,
   T helper 1 cells (Th1 cells), T helper 2 cells (Th2 cells), and Tregs
   (Table [86]1). The different macrophage polarization forms in the gene
   expression profiles are labeled according to the corresponding
   cytokines used for their in vitro induction.[87] ^45

FIGURE 1.

   FIGURE 1
   [88]Open in a new tab

   The workflow of this study (A) The gene expression profiles of 17 cell
   types were downloaded from NCBI GEO. (B) Creating refGES for 17 types
   of cells. The differentially expressed genes (DEGs) within each of the
   seven groups were ranked by p‐value, and then minimization of the
   condition number was performed to select the top DEGs. (C) HCCImm was
   built using ε‐SVR regression. After removing the predicted fraction of
   HCC cells, the fractions of the remaining 16 types of immune cells were
   normalized to sum 100%. (D) Survival analysis was performed to
   determine whether there was an association between the predicted
   abundance of immune cell types and patient prognosis.

TABLE 1.

   Cell types and replicates used to build the reference gene expression
   signature matrix.
   Cell type      Replicates Cell type        Replicates
   HCC_cell_lines 19         B_memory         18
   M1_IFNγ        9          B_naive          25
   M1_IFNγ_TNF    4          Memory_CD8_Tcell 12
   M1_LPS         6          Naïve_CD8_T      15
   M2_IL4_Dex     15         Naïve_CD4_T      10
   M2a            12         Treg             10
   M2c            8          Th1              7
   DC_cell        13         Th2              9
   NK_cell        12
   [89]Open in a new tab

   To create the refGES for CCD, one idea might be to perform ANOVA and
   t‐test analysis to identify the differentially expressed genes across
   the 17 cell types. However, within the 17 cell types, there are closely
   related subsets, and each cell‐type subset is likely to have highly
   correlated expression profiles. For example, the three types of M1
   cells induced by different cytokines, as listed in Table [90]S1, have
   very similar gene expression profiles, as revealed by cluster analysis.
   If the cell‐specific signature genes were collected by investigating
   the differentially expressed genes (DEGs) in the 17 cell types via
   ANOVA, it is possible that the acquired cell signature genes might not
   be sufficient to distinguish these highly correlated cell types. In
   addition, highly correlated independent variables can cause
   multicollinearity during linear regression. To mitigate these concerns,
   we created our refGES by specifically choosing genes that are
   differentially expressed between the closely related cell subtypes
   (Figure [91]1B).

   For the implementation of the computational code for HCCImm, we used
   the linearSVR program in the Python package “scikit‐learn.” We chose
   ε‐support vector regression (ε‐SVR), subject to an L1 penalty, as the
   core algorithm to perform the deconvolution in this study. Our
   ε‐SVR‐based deconvolution method, HCCImm, was designed in‐house to
   analyze bulk expression profiles from samples with an unknown cell
   composition to predict the proportions of the 16 immune cell types
   present in liver cancer tissue samples.

3. IMPLEMENTATION

3.1. The refGES

   The aim of this study was to quantify the 16 immune cell types
   described above, as well as HCC cells, present in the TME of HCC, and,
   therefore, 196 microarray samples were downloaded from NCBI GEO, and at
   least four replicate samples were included for each of the 17 cell
   types (Figure [92]1A). All of the samples downloaded from NCBI GEO were
   analyzed using the same platform, Affymetrix U133 plus 2 (NCBI GEO
   platform accession: [93]GPL570). The numbers of replicates for the
   different types of cells are listed in Table [94]1, and the NCBI GEO
   sample accession numbers for those microarray samples are provided in
   Table [95]S1.

   The microarray data were quantile normalized using the Robust
   Multiarray Average (RMA) procedure offered in the R package “affy”.[96]
   ^46 ANOVA was performed using the R package “statistics” to identify
   the genes that had significantly higher expression in certain types of
   immune cells compared to that in other cell types. The expression
   values of those differentially expressed genes (DEGs) were used to
   build the initial matrix for the gene expression signatures of the
   immune cells.

   Our approach to construct the refGES consisted of three steps
   (Figure [97]1B). First, in addition to the 17 cell‐type group, we
   further divided the 17 cell types into six groups. Each of the six
   immune cell groups consisted of the cell types that had closely
   correlated gene expression profiles (Table [98]S1). For example, the M1
   group consisted of M1_LPS, M1_IFNγ, and M1_IFNγ_TNF, while the M2 group
   consisted of M2a, M2_IL4_DEX, and M2c.

   Second, to construct the refGES, differentially expressed genes were
   selected from each of the seven cell groups, including the 17‐cell‐type
   group and the six groups of immune cell subsets. In each of the seven
   groups, all of the genes were sorted by their p‐values in ascending
   order. The top k differentially expressed genes were collected from
   each group.

   Third, the median expression values of these selected differentially
   expressed genes from each of the 17 cell types were used to build the
   initial refGES, with the dimension of approximately k genes × 7 groups
   × 17 cell types × iteration #. We set k = 16, but other k values might
   work equally as well. A differentially expressed gene in one of the
   seven groups might be found to be differentially expressed again in one
   or more of the six groups, and therefore, in such circumstances, the
   number of genes to be included to build the refGES could be less than k
   in each iteration for each of the seven cell type groups. Next, we used
   condition numbers to determine how many differentially expressed genes
   should be included in the refGES. Next, we used QIAGEN's Ingenuity^®
   Pathway Analysis (IPA^®, QIAGEN) to identify the diseases and functions
   associated with these genes (Supplementary Figure [99]S1).

   To make the refGES more robust against input variations or noise, we
   used condition numbers as a means to determine how many differentially
   expressed genes should be included in the final matrix. Technically,
   the condition number of a matrix is the product of the norm of the
   matrix and the norm of its inverse. It is a measurement of how
   sensitive a mathematical function might be to changes or errors in the
   input data. In numerical analysis, a function with a low condition
   number is considered to be well‐conditioned and likely to be a
   relatively stable solution when there are small fluctuations in the
   input data. A number of CCD methods have applied the minimization of
   condition numbers to optimize the refGES.[100] ^22 , [101]^27 ,
   [102]^29 Specifically, the R function “kappa” was used to estimate the
   condition number of each matrix. Finally, we created the refGES matrix
   that consisted of the median gene expression values of 201 genes
   expressed in the 17 types of cells, including 16 immune cell types, as
   well as HCC cells (Table [103]S1).

   Each of the cell types listed in Table [104]1 can be further
   categorized into distinct subsets. However, these subsets were not
   inputted as immune cell subtypes when building our refGES. For example,
   in the reference microarray dataset of memory CD8^+ T cells, three
   subsets were noted, including central memory CD8^+ T cells, effector
   memory CD8^+ T cells, and CD45RO^− memory CD8^+ T cells. However, when
   building our refGES, all three memory CD8^+ T subsets of microarray
   data were regarded as a single type of memory CD8^+ T cells. We chose
   to use this kind of approach because we noticed that the gene
   expression patterns of these subsets belonging to the same cell types
   bear high resemblance. When using the refGES consisting of additional
   immune cell types corresponding to these distinct subsets, the
   regression model made poor predictions in benchmark testing using the
   PBMC samples or the samples containing only pure cell types, deviating
   immensely from the true proportions.

   Perhaps a critical reason to prevent us from further dividing the cell
   subsets to include more cell subtypes in our method is due to the
   resolution limits of expression signal deconvolution. One challenge in
   developing CCD methods is the high correlation of the expression
   profiles of closely related cell types.[105] ^47 The gene expression
   signals derived from low‐abundance cell types might be masked by the
   expression of the same gene(s) in a more abundant cell subset.[106] ^9

   Moreover, a remaining issue is whether the constructed refGES can be
   applied to estimate the cell compositions in samples in which there are
   certain subsets of immune cells that do not align perfectly with the
   reference cell types in our refGES. For example, it is unclear whether
   the predicted fraction of memory CD8^+ T cells reflects the true
   compositions in samples containing distinct subsets of memory CD8^+ T
   cells. Such subsets of memory CD8^+ T cells are referred to as
   nontypical memory CD8^+ T (NT‐MCD8T) cells in the following text to
   distinguish them from the reference memory CD8^+ T cells samples used
   to build our refGES for the HCCImm method. Therefore, to address this
   issue, at least partially, an assessment was carried out by using two
   additional datasets, NCBI GEO [107]GSE85074 and [108]GSE26495, in which
   there are microarray data of memory CD8^+ T cells that are either
   PD1^high, PD1^low, effector, or central subtype (Table [109]S2). This
   assessment revealed that the gene expression profiles of these subsets
   of memory CD8^+ T cells are indeed more similar to those of memory
   CD8^+ cells than to those of naïve CD8^+ T cells in our reference
   datasets.

   To estimate the HCC cell fraction in the HCC samples, we decided to use
   the gene expression profiles of Huh7 cells to build our refGES. An
   important reason is that the GEP of the Huh7 cell line, which is a
   human hepatoma‐derived cancer cell line, mimics that of human
   HCC‐derived metabolic genes.[110] ^48 To further assess whether Huh7
   might be superior to other HCC cell lines as a representative of HCC
   tumor cells, we performed a correlation analysis to determine the
   correlation of Huh7 and HepG2 cell lines with human HCC samples. The
   NCBI GEO accession numbers of these microarray data, including 10 Huh7,
   9 HepG2, and 50 human HCC samples, are listed in Table [111]S3. The
   GEPs of liver tumor samples derived from 50 HCC patients (NCBI GEO
   [112]GSE45267) were more similar to those of Huh7 cells than to those
   of HepG2 cells (Supplementary Figure [113]S2).

   For CCD method benchmarking, it is essential to map the cell types of
   the different methods to the same cell‐type ontology. For example,
   CIBERSORTx is designed to estimate the fractions of 22 cell types in a
   cell admixture,[114] ^29 , [115]^30 whereas quanTIseq can quantify only
   10 immune cell types.[116] ^44 Since different methods elucidate cell
   types in more or less detail, it is a common practice when assessing
   CCD methods such as CIBERSORTx to sum up the predicted fractions for
   the cell subsets belonging to the same cell type. Conversely, when CCD
   methods can give only a single fraction value for one cell type but not
   the detailed fractions for distinct cell subsets belonging to that one
   cell type, this predicted fraction corresponds to the sum of the cell
   subset fractions.[117] ^44

3.2. ε‐support vector regression

   One common CCD approach is the use of linear models, where the gene
   expression levels of a mixed sample (x) are modeled as a linear
   combination of the gene expression levels of the individual cell types
   that make up the sample. In this context, bulk GEP (b) refers to the
   gene expression profile of the mixed sample, while refGES (a) refers to
   the reference gene expression profiles of the individual cell types.
   The linear model is used to calculate the proportions of each cell type
   in the mixed sample by solving a system of linear equations.

   Deconvolution can be conceived as finding the solution to the
   convolution equation:
   [MATH: <mrow><mrow><msub><mi>b</mi><mi>i</mi></msub><mo
   linebreak="goodbreak">=</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>,</mo><
   mn mathvariant="italic">1</mn></mrow></msub><msub><mi>x</mi><mn
   mathvariant="italic">1</mn></msub><mo
   linebreak="goodbreak">+</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>,</mo><
   mn mathvariant="italic">2</mn></mrow></msub><msub><mi>x</mi><mn
   mathvariant="italic">2</mn></msub><mo>…</mo><msub><mi>a</mi><mrow><mi>i
   </mi><mo>,</mo><mi>j</mi></mrow></msub><msub><mi>x</mi><mi>j</mi></msub
   ></mrow></mrow> :MATH]

   where b [i ]is the expression level of gene i in a sample of the cell
   mixture, a [i,j ]is the expression level of gene i in cell type j
   (derived from the reference expression signature for this cell type),
   and x [j ]is the unknown proportion of cell type j in the cell mixture.
   Among the available CCD tools, CIBERSORTx uses a 𝜈‐support vector
   regression (SVR)‐based approach, and this tool has outperformed other
   tools in benchmarking experiments.[118] ^29 Since SVR seems to be
   superior to the other regression methods in making predictions, we
   chose one type of SVR, ε‐insensitive support vector regression (ε‐SVR)
   which can provide a sparse solution, as the core algorithm to perform
   deconvolution in this study. Our ε‐SVR‐based method is named HCCImm. To
   implement HCCImm, we used the linearSVR function in the Python package
   “scikit‐learn.” Unlike the 𝜈‐SVR used by CIBERSORTx, ε‐SVR does not
   control the proportion of the support vectors to be used in the final
   model.[119] ^49 As a result, an ε‐SVR‐based deconvolution model might
   have higher flexibility when combining different predictor variables
   without setting a lower bound for the support vectors. After all, it
   would be arbitrary to decide a lower bound for the immune cell types of
   one bulk tissue without prior knowledge of its cell composition. In
   addition, the L1‐loss function was applied to minimize the mean average
   error (MAE) between the predicted and real values.[120] ^50 Compared to
   the L2‐loss function, the L1‐loss function makes the regression model
   more robust to outlier values in the input data. The L2‐loss function
   can produce much larger errors for outliers because this approach takes
   into consideration the squared differences.

3.3. Other linear regression models

   In addition to ε‐SVR, we tested a number of other linear regression
   models, each being subject to one of the three types of regularization.
   The objective function and the loss terms are presented below.
   [MATH: <mrow><mrow><mtext mathvariant="italic">Objective
   function</mtext><mo>:</mo><mi>min</mi><mfrac><mn>100</mn><mi>i</mi></mf
   rac><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><m
   i>i</mi></munderover><mfenced close="|"
   open="|"><mfrac><mrow><msub><mi>b</mi><mi>n</mi></msub><mo>−</mo><mover
   accent="true"><msub><mi>b</mi><mi>n</mi></msub><mo
   stretchy="true">^</mo></mover></mrow><msub><mi>b</mi><mi>n</mi></msub><
   /mfrac></mfenced><mo linebreak="goodbreak">+</mo><mtext
   mathvariant="italic">regularized term</mtext><mo>,</mo><mtext
   mathvariant="italic">where</mtext></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mover accent="true"><mi>b</mi><mo
   stretchy="true">^</mo></mover><mo
   linebreak="goodbreak">=</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>,</mo><
   mn>1</mn></mrow></msub><msub><mi>x</mi><mn>1</mn></msub><mo
   linebreak="goodbreak">+</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>,</mo><
   mn>2</mn></mrow></msub><msub><mi>x</mi><mn>2</mn></msub><mo>…</mo><msub
   ><mi>a</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><msub><mi>
   x</mi><mi>j</mi></msub></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mi>L</mi><mn>1</mn><mspace
   width="0.25em"></mspace><mtext
   mathvariant="italic">loss</mtext><mo>:</mo><mi>α</mi><mo>∑</mo><mfenced
   close="|" open="|"><mi>x</mi></mfenced></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mi>L</mi><mn>2</mn><mspace
   width="0.25em"></mspace><mtext
   mathvariant="italic">loss</mtext><mo>:</mo><mi>α</mi><mo>∑</mo><msup><m
   i>x</mi><mn>2</mn></msup></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mtext mathvariant="italic">Elastic</mtext><mspace
   width="0.25em"></mspace><mi mathvariant="italic">Net</mi><mspace
   width="0.25em"></mspace><mtext
   mathvariant="italic">loss</mtext><mo>:</mo><mi>α</mi><mo>∑</mo><mfenced
   close="|" open="|"><mi>x</mi></mfenced><mo
   linebreak="goodbreak">+</mo><mi>β</mi><mo>∑</mo><msup><mi>x</mi><mn>2</
   mn></msup></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mtext mathvariant="italic">Subject
   to</mtext><mspace
   width="0.25em"></mspace><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></mrow>
   :MATH]

3.4. Benchmarking of cell deconvolution methods

3.4.1. Analysis of pure cell types

   The benchmarking test to assess the performance of HCCImm initially
   used pure cell types and followed a leave‐one‐out strategy. In each
   test run, one of the 196 microarray samples obtained from NCBI GEO to
   build the refGES for the seventeen cell types was used as the test
   dataset for deconvolution, while the remaining 203 samples were used to
   build the refGES for testing using the aforementioned ANOVA‐based
   procedure. Bar plots were produced to show the distributions of the
   predicted cell‐type fractions in those pure‐cell samples. The results
   were compared with the predictions made by CIBERSORTx and quanTIseq, as
   well as the least absolute shrinkage and selection operator (LASSO),
   L1‐, L2‐, and elastic net‐regularized linear regressions.

3.4.2. Analysis of memory CD8 ^+ T‐cell sample subsets by HCCImm

   Memory CD8^+ T cells have many subsets, including central memory CD8^+
   T cells, effector memory CD8^+ T cells, and others, such as PD1^+ CD8^+
   T cells. To confirm which memory CD8^+ T‐cell subsets HCCImm could
   estimate, two additional CD8^+ T‐cell datasets, [121]GSE85074 and
   [122]GSE26495, were assessed using the model.

3.4.3. Analysis of human PBMCs by flow cytometry

   The performance of HCCImm was also assessed by using real mixtures of
   gene expression profiles derived from PBMCs. First, a set of microarray
   samples of human PBMCs, [123]GSE107990, was downloaded from NCBI GEO;
   this dataset has 164 samples. Since each of these PBMC samples had its
   immune cell composition determined by flow cytometry, we compared the
   cell abundance predicted by HCCImm with the experimentally determined
   composition. The platform used to obtain those microarray samples was
   the Illumina HumanHT‐12 V4.0 expression beadchip and the raw dataset
   was quantile‐normalized using the R package “preprocessCore.” We
   followed a simple cross‐platform quantile normalization approach that
   is described in a previous study.[124] ^51 Mean absolute errors (MAEs)
   and Pearson's correlation coefficients were calculated to compare the
   performances of HCCImm against those of other CCD tools.

   Next, to assess whether HCCImm can be applied to analyze the cell
   composition of RNA‐seq datasets, such as TCGA cancer samples, a set of
   nine RNA‐seq samples derived from the human PBMC dataset [125]GSE107572
   was downloaded from NCBI GEO; these samples had their composition
   confirmed by flow cytometry for eight types of immune cells.[126] ^44
   To perform this benchmark test, the raw counts of the RNA‐seq data were
   transformed by transcripts‐per‐millions (TPM) normalization. Following
   this, the normalized values were further adjusted to make the RNA‐seq
   data have a distribution with a maximum value close to that of the
   microarray datasets used to build our refGES. When using CIBERSORTx and
   quanTIseq, gene expression values in TPM were used, as has been
   suggested by two papers.[127] ^44 , [128]^52 The cell type mapping from
   the predictions made by CIBERSORTx, quanTIseq, and HCCImm is provided
   in Table [129]S4(A).

   A common approach for evaluating the performance of CCD tools involves
   mapping the predicted fractions of immune cell subtypes to higher‐level
   immune cell types.[130] ^21 , [131]^34 This strategy typically starts
   by comparing the predicted fractions of immune cell subtypes with true
   experimentally obtained fractions. In cases where the true fraction is
   only available for a higher‐level immune cell type, but not for its
   constituent subtypes, the predicted fractions of the subtypes are
   usually aggregated to estimate the fraction of the higher‐level immune
   cell type.

   As an example, HCCImm can predict the fractions of two subtypes of
   CD8^+ T cells: naïve CD8^+ T cells and memory CD8^+ T cells. In the
   evaluation using the [132]GSE107990 dataset, the predicted fractions
   for these subtypes were summed to obtain an aggregated fraction value
   for the overall CD8^+ T‐cell population (Table [133]S4A).

   Other CCD tools have been assessed in a similar manner, such as
   CIBERSORT, CIBERSORTx, and quanTIseq.

3.4.4. Analysis of the simulated bulk gene expression profiles of liver
cancer tissue samples

   To evaluate the ability of HCCImm to accurately estimate the cell
   composition of the highly cancerous HCC TME, we conducted in silico
   simulations by spiking the expression signals of 16 immune cell types
   into the microarray data of an HCC cell line. We simulated bulk gene
   expression profiles of liver cancer tissue by randomly selecting one
   HCC cell sample from five datasets (GEO [134]GSE68927, [135]GSE112788,
   [136]GSE128517, [137]GSE75024, or [138]GSE78736). Then the expression
   signals for the 16 immune cell types were randomly sampled from the 185
   immune cell microarray samples, as shown in Table [139]S1.
   [MATH: <mtable><mtr><mtd
   columnalign="left"><mrow><mrow><mi>A</mi><mspace
   width="0.25em"></mspace><mtext
   mathvariant="italic">simulated</mtext><mspace
   width="0.25em"></mspace><mi mathvariant="italic">HCC</mi><mspace
   width="0.25em"></mspace><mtext mathvariant="italic">gene expression
   profile</mtext><mo linebreak="goodbreak">=</mo><msub><mi>P</mi><mi
   mathvariant="italic">HCC</mi></msub><mo>*</mo><mi
   mathvariant="italic">HCC</mi><mspace width="0.25em"></mspace><mi
   mathvariant="italic">GEP</mi></mrow></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mrow><mo
   linebreak="goodbreak">+</mo><mo>∑</mo><msub><mi>P</mi><mi>j</mi></msub>
   <mo>*</mo><msub><mi
   mathvariant="italic">GEP</mi><mi>j</mi></msub><mo>,</mo><mspace
   width="0.5em"></mspace><mtext mathvariant="italic">where</mtext><mspace
   width="0.25em"></mspace><msub><mi>P</mi><mi
   mathvariant="italic">HCC</mi></msub><mo
   linebreak="goodbreak">+</mo><mo>∑</mo><msub><mi>P</mi><mi>j</mi></msub>
   <mo linebreak="goodbreak">=</mo><mn>1</mn></mrow></mtd></mtr></mtable>
   :MATH]

   j represents the immune cell type. Each simulated profile was generated
   by summing the GEPs of 17 cell types and weighting them according to
   their respective proportions which were randomly assigned.

   Specifically, we generated 100 simulated HCC tissue profiles for each
   of the six different cancer‐cell fractions (P [HCC ]) ‐ 15%, 30%, 45%,
   60%, 75%, and 90%. Therefore, in each set of 100 simulated profiles,
   each profile had a predetermined P [HCC ]fraction of gene expression
   signals derived from a randomly selected HCC cell microarray sample,
   and the remaining fraction (100% ‐ P [HCC ]) was composed of expression
   signals of 16 immune cell microarray samples which were randomly
   selected and weighted.

   In each test run, deconvolution was performed by taking the remaining
   microarray samples of HCC cells and immune cells (excluding the one HCC
   cell sample and the 16 immune cell samples used to simulate this bulk
   profile with spike‐ins) to build the refGES that was to be used for
   testing. When assessing the predictions, a normalization was performed
   such that the fraction making up the HCC cell line was discarded
   (Figure [140]1C), and the fractions of the 16 types of immune cells
   were summed to 100%.

   To investigate the agreement of the predictions made by the various CCD
   methods, including ours, Pearson's correlation coefficient and absolute
   error (AE) were calculated. Boxplots were produced to show the
   distributions of AEs, and the results were compared with the AEs of the
   predictions made by other methods. The cell type mapping to the ground
   truths from the predictions made by CIBERSORTx, quanTIseq, and HCCImm
   is provided in Table [141]S4(B,C).

3.5. Analysis of TCGA‐LIHC samples

   To predict the cell composition of HCC samples and to perform survival
   analysis of the patients, 371 RNA‐seq samples from the TCGA‐LIHC
   dataset[142] ^53 were downloaded from the NCI Genomic Data Commons (NCI
   GDC, [143]https://gdc.cancer.gov/) (Figure [144]1C,D). In addition, the
   tumor mutation data of the TCGA‐LIHC samples were obtained from the MC3
   project
   ([145]https://gdc.cancer.gov/about‐data/publications/mc3‐2017).[146]
   ^54 Finally, the drug response data of the TCGA‐LIHC patients, curated
   by Ding et al., were downloaded from
   [147]http://lifeome.net/supp/drug_response/.[148] ^55

4. RESULTS

4.1. Building the refGES

   Using the approaches mentioned in the Methods section, including
   prioritizing the DEGs in the cell subsets, and using the condition
   number calculation, a set of 201 DEGs was selected to build our refGES
   (Figure [149]1B, Supplementary Figure [150]S3). Thus, the dimension of
   the final refGES matrix used in this study was 201 genes × 17 cell
   types. The refGES matrix is provided in Table [151]S5.

   These 201 genes as a whole are described as the signature genes (SGs)
   in the following text. To reveal the clustering of the 196 microarray
   samples, these SGs were used to generate the t‐SNE (t‐distributed
   stochastic neighbor embedding) plot,[152] ^56 and from this plot, it is
   clear that the samples form a number of distinct clusters, each
   corresponding to a single cell type (Figure [153]2). In contrast, when
   using the t‐SNE plot generated with all of the ~20,000 gene expression
   values as the sample features, some clusters consisted of samples that
   belong to different cell types (Supplementary Figure [154]S4). For
   example, clusters made up of heterogeneous cell types were found for
   the DC samples, the M2c samples, the M1_LPS samples, and the M2a
   samples. Furthermore, memory CD8^+ T‐cell samples and naïve CD8^+
   T‐cell samples overlapped. Finally, the NK, Th1, Th2, and naïve CD4^+
   T‐cell samples did not form clear‐cut clusters (Supplementary
   Figure [155]S4). This result suggests that the expression values of
   these SGs can be used to distinguish the 17 types of cells that make up
   the samples.

FIGURE 2.

   FIGURE 2
   [156]Open in a new tab

   The t‐SNE plot of the 196 microarray samples using the cell type
   signature genes (SG) as the sample features. Each data point represents
   one sample and is color‐coded according to its cell type.

   We performed pathway enrichment analysis to explore the functions of
   the SGs and found that half of them were implicated in four types of
   functions/diseases involving the liver, namely, liver damage, necrosis
   of the liver, inflammation of the liver, and liver lesions
   (Supplementary Figure [157]S1). In contrast, the genes in the
   CIBERSORTx LM22 are only associated with one type of liver disease.

4.2. Concordance between the predicted cell composition and the true
proportions

4.2.1. Pure‐cell samples

   HCCImm was then assessed by gene expression profile analysis against
   the known proportions of immune cells. First, each of the 196
   microarray samples was tested using the leave‐one‐out strategy as
   mentioned in the Materials and Methods. Since each microarray sample
   was derived from only a single type of immune cell, each of these
   microarray samples was regarded as a pure‐cell sample, and the true
   cell composition for each of the pure‐cell samples was 100% for its
   specific type of cell.

   In this benchmark test, the prediction made by HCCImm was able to
   identify the composition of the major cell type in the pure‐cell
   samples. The predictions made by HCCImm were consistent with the
   original data for the seventeen types of cells. HCCImm, as well as the
   regularized regression approaches using exactly the same refGES built
   in this study, consistently predicted the dominant cell type with a
   median fraction of ~80% and short error bars, with the predicted
   fractions for other cell types being generally lower than 10%
   (Figure [158]3). In contrast, the median values for the cell fractions
   predicted by the two most widely used tools, namely, quanTIseq and
   CIBERSORTx, were lower than 60%, and there were larger variances.

FIGURE 3.

   FIGURE 3
   [159]Open in a new tab

   The bar plots for the benchmark using pure‐cell samples. Each graph
   corresponds to the predicted fraction of the specific cell type of one
   of the 196 pure‐cell samples. The performance of HCCImm, L1‐ and
   L2‐regularized linear regression, LASSO, quanTIseq, and CIBERSORTx are
   shown.

4.2.2. Memory CD8 ^+ T samples

   To evaluate whether our refGES is suitable for performing cell
   deconvolution analysis against the GEPs containing heterogeneous
   subsets of memory CD8^+ T cells, an independent test was performed by
   using two additional NCBI GEO datasets that were not used to build our
   refGES. These two datasets consist of a series of GEPs, where each
   corresponds to a distinct subset of memory CD8^+ T cells (NT‐MCD8T, see
   Materials and Methods, 3.1). Therefore, this assessment can also be
   conceived as an extended test using GEPs consisting of pure‐cell
   samples.

   First, by carrying out correlation analysis, we noticed that the GEPs
   of these NT‐MCD8T data were more similar to those of the memory CD8^+ T
   cells (MCD8T) used to build our refGES than to those of the naïve CD8^+
   T cells (Supplementary Figure [160]S5). Next, we predicted the cell
   compositions of these NT‐MCD8T data using a linear regression approach,
   LASSO. We found that LASSO regression consistently predicted the
   dominant cell type as MCD8T, with a high median fraction of ~74% and a
   narrow interquartile range (IQR), and the sum of the predicted
   fractions for other cell types was generally lower than ~34%
   (Supplementary Figure [161]S6). The LASSO regression result suggests
   that even though some subsets of memory CD8^+ T cells may not perfectly
   match the features of MCD8T cells used in the refGES, our refGES is
   suitable for analyzing GEPs that contain such memory CD8^+ T‐cell
   subsets.

   We used two sets of DEGs for further analysis. One DEG set was prepared
   by comparing the GEPs of memory CD8^+ T cells and naïve CD8^+ T cells.
   The other DEG set was prepared by comparing the GEPs of PD1‐high CD8^+
   T cells (for the GEO accession, see Table [162]S2) and naïve CD8^+ T
   cells. The significantly enriched pathways in both DEG sets are
   involved in the regulation of exhausted CD8^+ T cells (Supplementary
   Figure [163]S7), including calcium signaling,[164] ^57 mitochondrial
   dysfunction,[165] ^58 the sirtuin signaling pathway[166] ^59 and
   glucocorticoid receptor signaling.[167] ^60

4.2.3. PBMC microarray and RNA‐seq samples

   Unlike the previous evaluation of our method, where only pure‐cell
   samples were used, the next set of benchmark testing was based on real
   human samples consisting of multiple cell types, namely, human PBMCs
   with linked flow cytometry results. The purpose of this test was to
   evaluate whether HCCImm still performs well when the samples are
   derived from real human tissues. Importantly, however, in these
   samples, the cell types that had been quantified were not completely
   consistent with those that could be predicted by HCCImm and by the
   other CCD tools. For example, in the [168]GSE107990 dataset, only the
   fraction of the precursor of macrophages, namely, monocytes, had been
   determined, but not the different polarization forms of macrophages.
   Another example is that only CD4^+ T cells were experimentally
   determined for this PBMC microarray dataset, whereas HCCImm is able to
   predict the fractions of five subtypes of CD4^+ T cells, including
   naïve CD4^+ T, Treg, Th1, Th2, and memory CD4^+ T cells. To compare the
   predicted fractions with the original differential cell type data
   during this benchmark, the fractions predicted by HCCImm for various
   polarization forms of macrophage M1 and M2 cells were summed to give an
   estimate of the total monocyte fraction in the samples. The full table
   mapping the predicted cell types against the experimental cell type
   dataset is provided in Table [169]S4(A).

   The results show that HCCImm was superior to the other regularized
   linear regression approaches tested in this study, as well as publicly
   available tools such as quanTIseq and CIBERSORTx. Specifically, our
   approach has a significantly higher median value for the correlation
   coefficients (~0.86) between the predictions and the experimental PBMC
   microarray datasets (Figure [170]4).

FIGURE 4.

   FIGURE 4
   [171]Open in a new tab

   Boxplots of the correlation coefficients of the predictions against the
   human PBMC microarray sample dataset [172]GSE107990. Elastic, L1, and
   L2 correspond to the linear regression methods subject to elastic net,
   L1‐regularization, and L2‐regularization, respectively. (A) Boxplots of
   the Pearson's correlation coefficients for HCCImm, L1‐, L2‐, and
   elastic net‐regularized methods. (B) Boxplots of Pearson's correlation
   coefficients for HCCImm, CIBERSORTx, and quanTIseq.

   However, when the nine RNA‐seq samples from human PBMCs with known cell
   fractions were used for benchmarking, it could be seen that the
   predictions made by quanTIseq, a method originally designed to analyze
   RNA‐seq data, had a high correlation with the experimental cell
   fractions. In this benchmark test, the performance of HCCImm was better
   than that of quanTIseq, with the median values of the correlation
   coefficients being ~0.88 and ~ 0.85, respectively, whereas higher
   variances in the correlation coefficients can be seen for the
   predictions made by CIBERSORTx. The predictions made by HCCImm have
   lower mean square errors (MSEs) than those of the other two methods
   (Figure [173]5A). The median error of the predictions made by HCCImm
   was closer to zero (Figure [174]5B). The assessment suggests that both
   quanTIseq and our ε‐SVR‐based method, HCCImm, outperform CIBERSORTx
   when real human RNA‐seq samples are used. When using the expression
   signals derived from multiple cell types, the quanTIseq approach could
   quantify only ten types of immune cells. Although the refGES used in
   this study was originally built to utilize microarray samples, this
   benchmark test supports that HCCImm can also be applied to analyze
   RNA‐seq samples.

FIGURE 5.

   FIGURE 5
   [175]Open in a new tab

   (A) Corrplot of Pearson's correlation coefficients and (B) boxplots of
   the prediction errors using the experimental cell fractions from the
   human PBMC RNA‐seq dataset.

4.3. Simulated bulk tissues with different fractions of HCC cells

   Next, an association analysis between the immune cell compositions and
   the survival of HCC patients, as well as other cancer features, was
   performed. This part of the study was designed to take advantage of the
   gene expression data of the TCGA‐LIHC dataset along with the available
   clinical data, including survival time. A benchmark test to reveal the
   performance of HCCImm when applied to the bulk expression profiles of
   HCC samples was developed. However, bulk GEPs are seldom accompanied by
   experimentally determined cell fractions. Therefore, we simulated the
   bulk expression profiles of the HCC samples to assess our CCD method.

   Since the dominant cell type in the TME is most likely tumor cells,
   they should make up a high proportion of the bulk GEP. Therefore, we
   created synthetic datasets that contained a fixed percentage (x%) of
   gene expression signals derived from HCC cells, and the remaining
   fraction (100% ‐ x%) was composed of signals from the sixteen types of
   immune cells that were randomly sampled and weighted. The results
   revealed that HCCImm, our ε‐SVR‐based method, was quite robust, showing
   consistently high correlations between the predicted cell fractions and
   predefined fractions of tumor cells (Figure [176]6). Although the
   predictions made by quanTIseq had higher correlations with original
   data, when the fractions of HCC cells in the simulated datasets were
   higher than 75% (Figure [177]6), the predictions made by HCCImm had
   significantly smaller AEs than those of the predictions made by
   quanTIseq and CIBERSORTx (Figure [178]7A–F).

FIGURE 6.

   FIGURE 6
   [179]Open in a new tab

   Boxplots of Pearson's correlation coefficients of the stimulated
   dataset predictions compared to the original data. The plots were
   generated using the simulated bulk HCC samples for each benchmark based
   on the gene expression signals with a fixed fraction of HCC cells,
   namely, 15%, 30%, 45%, 60%, 75%, and 90%. Each data point in each
   boxplot represents a simulated HCC sample.

FIGURE 7.

   FIGURE 7
   [180]Open in a new tab

   Boxplots of the predicted cell fraction absolute errors (AEs). A–F: The
   boxplots correspond to each benchmark based on the gene expression
   signals with a fixed HCC cell fraction, namely, 15%, 30%, 45%, 60%,
   75%, and 90%, in the simulated bulk tissues. G,H: The boxplots
   correspond to the benchmark tests using the microarray and RNA‐seq
   samples of PBMCs, respectively.

   Furthermore, to evaluate the effect of refGES on the cell deconvolution
   model, we utilized our refGES instead of the default LM22 matrix to
   carry out CIBERSORTx analysis, and this test was labeled
   CIBERSORTx.ourref. Based on the correlation between the
   CIBERSORTx.ourref result and the original data of the simulated
   datasets, we determined that its performance was better than that of
   CIBERSORTx with default LM22, albeit still not as robust as HCCImm.

   However, for the previous benchmark using pure‐type samples, the
   performances of L1‐regularized, L2‐regularized, and elastic
   net‐regularized linear regressions were slightly better than that of
   HCCImm (Figure [181]3). This benchmark using simulated HCC samples
   supports the hypothesis that our ε‐SVR‐based method, HCCImm, together
   with the new refGES, is likely to be more suitable for analyzing HCC
   bulk tissue samples that contain multiple cell types.

4.4. Association of immune cell abundance with HCC patient survival and TMB

   The ultimate goal of this study was to explore whether there is an
   association between immune cell abundance and the clinical features of
   HCC patients, such as survival time and TMB. Among the TCGA‐LIHC
   patients, 369 patients had complete survival information, and these
   patients were extracted for cell fraction prediction
   (Figure [182]1C,D). The raw counts of the RNA‐seq samples of these
   patients were converted to TPM values and transformed as mentioned in
   the “Methods” section.

   The strategy used for the association analysis between the immune cell
   fractions and patient survival is as follows. First, the cell fractions
   of the 369 TCGA‐LIHC RNA‐seq samples were predicted. The patients were
   then categorized into four groups based on their risk histories
   (Figure [183]8). The association of the cell fractions with the OS of
   the HCC patients was then assessed by the log‐rank test and univariate
   Cox regression analyses (Figure [184]1D). To consider the potential
   bias when performing cross‐sample comparisons of the abundance of each
   immune cell type, we decided that the predicted HCC cell fractions
   should be removed and the remaining fractions of the sixteen types of
   immune cells in each sample were then normalized to 100%
   (Figure [185]1C). We argue that the potential variations in the
   resection procedure during tumor sample acquisition may have affected
   the RNA‐seq results, such that the proportion of gene expression
   signals derived from HCC cells across different samples. For the
   respective association analysis between immune cell abundance and
   patient prognosis, the Kaplan–Meier method was used to create survival
   curves. For each immune cell type, the 369 HCC patients were divided
   into high and low groups. The high group represented the patients with
   a predicted fraction of one immune cell type higher than the median
   value, and vice versa. By using this approach, it was found that the
   abundance of three immune cell types was significantly associated with
   HCC patient OS (Figure [186]9). HCC patients with higher fractions of
   memory CD8^+ T cells and Treg cells had a worse prognosis (log‐rank
   test p‐values of 0.055 and 0.037, respectively). In contrast, HCC
   patients with a higher fraction of naïve CD8^+ T cells were found to
   have a significantly better prognosis (log‐rank test p‐value: 0.013).
   In addition, for subgroups of HCC patients carrying risk factors,
   including HCV and alcohol consumption, a higher fraction of Tregs and a
   higher fraction of memory CD8^+ T cells were associated with a worse
   prognosis (Supplementary Figure [187]S8).

FIGURE 8.

   FIGURE 8
   [188]Open in a new tab

   TCGA‐LIHC patients were categorized into four risk groups.

FIGURE 9.

   FIGURE 9
   [189]Open in a new tab

   Kaplan–Meier survival curves of HCC patients were created by dividing
   patients into high and low groups corresponding to high and low
   abundances of naïve CD8^+ T cells, Treg cells, and memory CD8^+ T
   cells, A, B, and C, respectively. The red lines indicate the high
   group, while the green lines indicate the low group.

   The Kaplan–Meier analysis indicated that the higher abundance group of
   naïve CD8^+ T cells was a significant factor for the survival of HCC
   patients (Figure [190]9). Two subtypes of CD8^+ T cells are associated
   with HCC patient prognosis. Patients were divided into four groups
   based on their proportions of naïve and memory CD8^+ T cells: higher
   proportions of naïve CD8^+ T cells and higher proportions of memory
   CD8^+ T cells (naïve_h&mem_h), higher proportions of naïve CD8^+ T
   cells and lower proportions of memory CD8^+ T cells (naïve_h&mem_l),
   lower proportions of naïve CD8^+ T cells and higher proportions of
   memory CD8^+ T cells (naïve_l&mem_h), and lower proportions of naïve
   CD8^+ T cells and lower proportions of memory CD8^+ T cells
   (naïve_l&mem_l). The Kaplan–Meier plot analysis revealed the
   association between these CD8^+ T‐cell subtypes and patient prognosis.
   The results showed that patients with a higher proportion of naïve
   CD8^+ T cells had a better prognosis, regardless of the memory CD8^+ T
   cell proportion. This finding suggests that the proportion of naïve
   CD8^+ T cells could serve as a valuable prognostic marker for HCC
   patients (Supplementary Figure [191]S9).

   To further investigate which immune cell types might be critical to HCC
   patient survival, univariate Cox regression analysis (Cox, 1972) was
   performed to assess whether there was a correlation between immune cell
   fractions and HCC patient OS. The hazard ratios (HRs) and 95%
   confidence intervals (CIs) of the sixteen immune cell types are shown
   in Figure [192]10. These results revealed that two types of immune
   cells, M2a cells, and naïve CD8^+ T cells, seem to have a positive
   impact on HCC patient OS. In contrast, memory CD8^+ T cells seem to
   have a negative impact on HCC patient OS (p‐value = 0.07). Furthermore,
   higher abundances of Treg cells and naïve B cells were found to be
   significantly associated with a worse OS (Treg: HR = 1.07, 95%
   CI = 1.02–1.12, p‐value = 0.01) (B_naïve: HR = 1.16, 95%
   CI = 1.00–1.35, p‐value = 0.05).

FIGURE 10.

   FIGURE 10
   [193]Open in a new tab

   Cox regression analysis between immune cell abundance and the OS of 369
   HCC patients using the proportions of sixteen immune cell types as the
   independent variable.

   Moreover, progression‐free survival (PFS), in addition to OS, can also
   provide valuable information, which shows how long a patient's cancer
   remains stable or does not worsen after the initial diagnosis.[194] ^61
   The results revealed that in the TCGA‐LIHC patients, the proportions of
   naïve CD8^+ T cells and Tregs were significantly associated with PFS.
   Naïve CD8^+ T cells seem to have a positive impact on HCC patient PFS,
   whereas Treg cells are likely to have a negative impact on HCC patient
   PFS, suggesting that these immune cell subsets may play a role in
   cancer progression and the response to treatment (Supplementary
   Figure [195]S10).

   We further categorized the HCC patients into distinct risk groups based
   on factors such as chronic hepatitis and alcohol consumption. Each risk
   group is likely to represent a distinct type of underlying condition
   related to chronic inflammation. For example, hepatitis B virus (HBV)
   carriers often suffer from chronic inflammation of the liver due to
   recurrent reactivation of their HBV infection,[196] ^62 and
   HBV‐specific CD8^+ T cells then infiltrate into their liver and acquire
   a specific phenotype.[197] ^63 Moreover, alcohol consumption might lead
   to a proinflammatory effect involving Kupffer cells, which has been
   shown to reduce the recruitment of CD8^+ T cells.[198] ^64 A number of
   associations between immune cell abundance and the various risk groups
   have been identified. For example, the HR of Treg cells in the OS of
   HBV‐HCC patients is 1.28, which is higher than that of all other HCC
   patient groups. Interestingly, unlike the survival of HBV‐HCC patients,
   the abundance of Treg cells does not appear to be significantly
   associated with the survival of HCC patients in the other risk groups
   (Figure [199]11). We noticed that in addition to the HBV group, the
   survival of the hepatitis C virus (HCV) group might be associated with
   the abundance of Tregs. A Cox hazard analysis for PFS in the HCV‐HCC
   group showed a HR = 1.12 and p‐value <0.05 for Treg abundance
   (Supplementary Figure [200]S10). We reviewed the literature and noticed
   that this association between Treg cell abundance and viral risk group
   survival might reflect the immunosuppressive role of Treg cells in the
   progression of chronic viral hepatitis, which might not occur in
   non‐viral HCC risk groups. In hepatitis viral infections, Treg cells
   play a protective role in the host by suppressing immune‐mediated
   mechanisms of liver damage.[201] ^65 Another study also provided
   evidence that HBV infection‐related factors could lead to the expansion
   of Tregs and enhance their suppressor function, which in turn could
   dampen the antitumor immune response to HCC tumor antigens and hinder
   the immune surveillance of HCC tumors.[202] ^66

FIGURE 11.

   FIGURE 11
   [203]Open in a new tab

   Cox regression analysis of immune cell abundance in HBV‐HCC patients
   using the fraction of 16 immune cell types as the independent variable.

   Another association was found between the M2 polarization subtypes, M2a
   and M2_IL4_DEX, and the OS of HCC patients with HBV and alcohol
   consumption risk factors. The HBV‐HCC patients and the alcohol‐HCC
   patients who had higher fractions of M2a (Figure [204]11) and
   M2_IL4_DEX (Figure [205]12), respectively, were found to have HRs
   significantly lower than 1.0. In contrast, the immune cell types
   associated with increased HRs among alcohol‐HCC patients are different
   from those associated with increased HRs among HBV‐HCC patients.
   Alcohol‐HCC patients with a higher abundance of memory CD8^+ T cells
   and naïve CD4^+ T cells have a worse prognosis (Figure [206]12).
   However, HCV‐HCC and HCC patients without other risk factors did not
   have a significant hazed ratio (Supplementary Figures [207]S11 and
   S12).

FIGURE 12.

   FIGURE 12
   [208]Open in a new tab

   Cox regression analysis of the immune cell abundance of alcohol‐HCC
   patients using the fraction of 16 immune cell types as the independent
   variable. Note: *p < 0.05.

4.5. Association between immune cell fraction and TMB

   Cancer immunotherapy has changed the treatment landscape for cancer
   patients. However, in terms of the treatment of HCC, only a small
   subset of patients respond well to immunotherapy.[209] ^67 Therefore, a
   better understanding of the interaction between HCC cells and immune
   cells in the TME should provide insights into possible prognostic and
   predictive biomarkers that can then be used to select patients who
   might benefit from ICI‐based treatment.[210] ^67 A high TMB has been
   proposed as a biomarker to predict the response to immune checkpoint
   blockade (ICB) based on the hypothesis that increased tumor mutations
   lead to the production of more antigenic peptides, which enhances
   immunogenicity.[211] ^68 In contrast, available evidence suggests that
   an association between TMB and immune signatures is dependent on the
   cancer type, and a high TMB has been associated with a poor prognosis
   among TCGA‐LIHC patients (Supplementary Figure [212]S13).[213] ^6
   Therefore, we investigated whether the TMB level of HCC patients was
   correlated with the levels of certain immune cell types that might be
   associated with immunosuppression in the HCC TME.

   We used HCCImm to investigate the association between TMB and the
   abundance of leukocytes in TCGA‐LIHC samples. The TCGA mutation calls
   were acquired from the MC3 calls in the Pancancer Atlas
   ([214]https://gdc.cancer.gov/about‐data/publications/pancanatlas).[215]
   ^54 We noticed that when the LIHC samples had a high TMB (TMB‐H), there
   were significantly lower fractions of the M2 cell subtypes (t‐test,
   p‐value = 0.017), and this was linked to a significantly higher sum for
   all of the fractions of the other leukocytes, specifically excluding
   macrophages (t‐test, p‐value = 0.001). An association of TMB‐H with M2
   cell levels and other tumor‐infiltrating lymphocyte (TIL) cell levels
   was also observed in the HBV‐HCC patient subgroup. In the TMB‐H group,
   the subset of patients with a higher fraction of Treg cells had a worse
   prognosis (log‐rank test p‐value: 0.039) (Figure [216]13A), whereas the
   subset of patients with a higher fraction of naïve CD8^+ T cells had a
   significantly better prognosis (log‐rank test p‐value: 0.044)
   (Figure [217]13B).

FIGURE 13.

   FIGURE 13
   [218]Open in a new tab

   Kaplan–Meier survival curves of HCC patients with high TMB were
   obtained by dividing the patients into two groups based on their level
   of (A) Treg cells and (B) naïve CD8^+ T cells.

   Moreover, to investigate the potential impact of Treg abundance on the
   relationship between HCC patient OS and their TMB levels, we generated
   waterfall plots separately for high‐ and low‐Treg groups (Supplementary
   Figure [219]S14). The y‐axis of the plots represents the estimated
   survival scale, which was calculated using a min‐max scaling method
   based on patients' survival time. These plots revealed that, within the
   first 2 years after diagnosis, HCC patients with high TMB and Treg‐high
   status experienced a significantly shorter overall survival time
   (survival scale y < 0.2, Fisher's exact test, p‐value of 0.017,
   Supplementary Figure [220]S14A). In contrast, in Treg‐low status, the
   result shows that the TMB levels were not associated with survival
   (Supplementary Figure [221]S14B). This finding provides further support
   for the notion that Treg abundance may play a role in the progression
   of HCC in patients with high TMB levels.

4.6. Immune cell levels and anticancer drug response among TCGA‐LIHC patients

   We explored the association between immune cell abundance and patient
   response to treatment. Although data on the responses to ICB therapy
   are not available for TCGA‐LIHC patients, a few records about the
   responses to cancer therapy have been documented in TCGA clinical data.
   Previously, Ding et al. curated drug treatment information of TCGA
   patients (Supplementary Figure [222]S15), wherein 180 clinical response
   records related to four chemotherapeutic drugs were obtained to
   determine the molecular features that can predict outcomes.[223] ^54
   Here, we took advantage of these records to assess the predictive
   utility of immune cell levels with respect to clinical drug responses
   among TCGA‐LIHC patients. Interestingly, all four of the LIHC patients
   that showed complete or partial responses to treatment had higher
   levels of DCs and naïve CD8^+ T cells (Fisher's exact test,
   p‐value = 0.057) (Table [224]2A). When we focused only on the HCC
   patient subgroup that had a high level of memory CD8^+ T cells or
   Tregs, a high abundance of naïve CD8^+ T cells was significantly
   associated with a greater treatment response (Fisher's exact test,
   p‐value = 0.042) (Table [225]2B).

TABLE 2.

   The 2 × 2 tables for TCGA‐LIHC patients with drug response records have
   high and low levels of naïve CD8^+ T cells. (A) All LIHC patients. (B)
   The HCC patient subgroup with a high level of memory CD8^+ T cells or
   Tregs.
   A
                      Responder No treatment records
   High naïve CD8^+ T 4         169
   Low naïve CD8^+ T  0         180
   B
                      Responder No treatment records
   High naïve CD8^+ T 3         70
   Low naïve CD8^+ T  0         135
   [226]Open in a new tab

5. DISCUSSION

5.1. Bulk RNA‐seq and differential polarizations

   Our ε‐SVR‐based method, HCCImm, was designed to use the bulk GEPs of
   HCC samples to predict the abundance of different major immune cell
   types such as B cells, CD8^+ T cells, CD4^+ T cells, and macrophages.
   The benchmark testing revealed that HCCImm had at least a comparable
   performance to other immune cell quantification methods when used to
   analyze bulk RNA‐seq samples. Furthermore, HCCImm was also able to make
   predictions for more cell types than earlier methods, such as
   quanTIseq, which was specifically designed to analyze RNA‐seq data. In
   addition, when analyzing the TCGA‐LIHC dataset, for >95% (355/369) of
   the cases, our predictions for cell composition passed the Monte Carlo
   test (p‐value <0.05). Therefore, HCCImm was able to predict the
   fractions of immune cell subtypes, which were then correlated with HCC
   patient OS.

5.2. Correlation between OS, treatment response, and immune cell composition

   One important issue concerns the association between TAMs and HCC
   patient OS. TAMs with the M2 phenotype have been proposed to be key
   players in cancer‐related inflammation and are involved in promoting
   the survival, growth, and metastasis of cancer cells, as well as the
   suppression of adaptive immunity.[227] ^69 However, in previous
   computational studies investigating immune cell composition, there have
   been no consistent findings with respect to an association between the
   level of macrophage polarization and HCC patient OS. For example,
   Manoharan et al. reported that a high level of monocytes, but not
   macrophages, was negatively associated with TCGA‐LIHC patient OS.[228]
   ^70 Foerster et al. stated that the absence of macrophages and Th2
   cells were positively correlated with TCGA‐LIHC patient OS. Peng et al.
   found that TCGA‐LIHC patients in the poor prognosis group had higher
   levels of M0 and M2 cells.[229] ^33

   Using HCCImm, we have been able to predict the cell fractions in the
   TCGA‐LIHC samples; this analysis included multiple polarized forms of
   macrophages, not just the binary M1 and M2 cell categories. After the
   analysis was carried out, we found no significant association with
   patient OS. Hourani et al.[230] ^71 reported that the subtypes of M2
   macrophages that are associated with poor cancer prognosis differ
   across various cancer types. In liver cancer, M2 cells have been
   commonly associated with poor prognosis. However, it remains unclear
   how the subtypes of M2 cells specifically impact different types of
   liver cancer. When we further investigated the subgroups of HCC
   patients with distinct risk histories, the HBV‐HCC patients and
   alcohol‐HCC patients with higher levels of M2a (Figure [231]11) and
   M2_IL4_DEX (Figure [232]12), respectively, were found to have HRs
   significantly lower than 1.0. This result suggests that not all subsets
   of M2 cells are associated with a poor HCC patient prognosis.

   This result seems to contradict the common viewpoint that TAMs with an
   M2 phenotype are associated with a poor prognosis among HCC patients.
   Perhaps because we have been unable to find a negative association of
   any macrophage subtype with HCC patient OS, the results might reflect
   the possibility that the selected macrophage subtypes in this study do
   not correspond well to the survival‐associated subsets of macrophages.
   Since TAMs are composed of heterogeneous subsets of functionally
   distinct macrophages, perhaps only a subset of TAMs might have an
   important association with patient survival, which is supported by
   recently published research using scRNA‐seq to investigate immune cells
   in the HCC TME.[233] ^72 Nonetheless, our results also indicate that
   some subsets of macrophages may improve the prognosis of several HCC
   patient subgroups. Notably, several meta‐analyses have also suggested
   that TAMs are not necessarily associated with poor cancer patient
   prognosis.[234] ^73 , [235]^74 , [236]^75

   CD8^+ T cells are another type of immune cell that is critical for
   tumor control and is known to be associated with a favorable HCC
   prognosis.[237] ^72 , [238]^76 Our analyses using the log‐rank test and
   Cox univariate regression revealed that HCC patients with a higher
   fraction of naïve CD8^+ T cells had a favorable prognosis. In contrast,
   the abundance of another CD8^+ T‐cell subset, memory CD8^+ T cells,
   seems to be associated with a HR significantly larger than 1.0
   (Figure [239]10). This result suggests that in the HCC TME, there is a
   positive association between memory CD8^+ T cells and patient OS, which
   is similar to that found for Treg immunosuppressive cells
   (Figure [240]10). At first glance, this finding also appears to
   contradict our current knowledge related to tumor‐infiltrating
   cytotoxic CD8^+ T cells, which have been shown to suppress the growth
   of tumor cells and therefore should be beneficial for patient OS.
   Nevertheless, CD8^+ T cells are known to be able to differentiate into
   exhausted T cells within the TME of various cancer types, including
   HCC. Furthermore, memory CD8^+ T cells in the HCC TME seem to present
   with a T‐cell exhaustion phenotype.[241] ^19 Our results showing the
   negative association between memory CD8^+ T cells and HCC patient OS
   might reflect the suppression of CD8^+ T cell cytotoxicity via the
   activation of the immune checkpoint pathway by PD‐L1^+ cells
   (Supplementary Figure [242]S16). Interestingly, based on the
   predictions made by HCCImm, we also found that a high level of naïve
   CD8^+ T cells in the HCC TME seems to be associated with patient
   response to cancer treatment (Table [243]2), although the available
   data are fairly limited. Overall, in the HCC TME, our results support
   the hypothesis that naïve CD8^+ T cells have a positive impact on
   patient survival and drug response.

5.3. Correlation of immune cell levels with TMB in TCGA‐LIHC patients

   By applying CIBERSORT to quantify 22 immune cell types, McGrail et al.
   reported that in TCGA‐LIHC samples, the abundance of CD8^+ T cells was
   not positively correlated with neoantigens.[244] ^68 A similar result
   was found in our study; however, we additionally assessed the
   association between TMB and other immune cell types. When assessing all
   369 HCC patients, a significant association of TMB‐H with the abundance
   of M2 cells and other TIL cells was found. The same association was
   also found in the subgroup of HBV‐HCC patients, although it was absent
   when subgroups with different risk histories (i.e., HCV, alcohol, and
   no history of risk factors) were assessed. The significant association
   of TMB‐H with TIL cells in TCGA‐LIHC patients has also been previously
   reported by Wang and Li.[245] ^6

   The association between TMB‐H and TIL cells in HBV‐HCC patients may
   suggest that during the chronic inflammation caused by the HBV, immune
   cells such as HBV‐specific CD8^+ T cells directly attack infected
   hepatocytes and subsequently recruit other components of the immune
   system, which then leads to immunopathogenesis and further hepatic
   damage.[246] ^63 Since there is no such finding for HCV‐HCC patients
   and alcohol‐HCC patients, one interpretation is that distinct histories
   in terms of risk factors can lead to a range of different
   carcinogenesis mechanisms. For example, chronic alcohol consumption
   decreases the number of antitumor CD8^+ T cells.[247] ^77

   Yan et al.[248] ^77 investigated the abundance of antitumor CD8^+ T
   cells in alcohol‐HCC patient samples and found that the advanced TME
   typically contains fewer of these cells than the early TME. In chronic
   HCC caused by HCV infection, CD8^+ T cells were found to be
   exhausted.[249] ^78 Using the HCCImm algorithm to analyze TCGA‐LIHC
   data, we found a significant difference in the abundance of naïve CD8^+
   T cells between early‐stage (stage I) and advanced‐stage (stage III and
   IV) alcohol‐related liver cancer. Specifically, our study revealed a
   lower abundance of naïve CD8^+ T cells in the advanced stage,
   suggesting a depletion of this cell population as the disease
   progresses (Figure [250]S17).

   Our results revealed that TMB‐H HCC patients, particularly the subgroup
   consisting of HBV‐HCC patients, are likely to have significantly higher
   levels of TILs and significantly lower levels of M2 macrophages.
   Therefore, HBV‐HCC patients with a high TMB might be a more suitable
   target group for ICB therapy than the HCC patient groups that have
   other risk histories, such as hepatitis C and chronic alcohol
   consumption.

6. CONCLUSION

   We built a new CCD tool, HCCImm, to estimate the abundance of sixteen
   immune cell types present in HCC gene expression samples. We have
   demonstrated its performance by analyzing our results against
   experimentally known profiles, including microarray and RNA‐seq
   profiles. This benchmark testing showed that HCCImm is a more robust
   approach compared to other available CCD methods. By applying this new
   tool to analyze TCGA‐LIHC samples and conducting various survival
   analyses, we found that there is a significant association between the
   abundance of naïve CD8^+ T cells and memory CD8^+ T cells and HCC
   patient prognosis. In addition, we also noticed that patients with a
   high TMB and a high abundance of naïve CD8^+ T cells had significantly
   better prognoses. Moreover, we found that the response of HCC patients
   to cancer treatment might be positively associated with the abundance
   of naïve CD8^+ T cells. Overall, HCCImm enables a robust and
   comprehensive exploration of immune cell composition in the HCC TME,
   and our findings are highly consistent with results derived from
   experimental investigations, which have generally focused on only a
   limited range of immune cell types.

AUTHOR CONTRIBUTIONS

   Yen Jung Chiu: Conceptualization (equal); data curation (equal); formal
   analysis (equal); investigation (equal); methodology (equal); project
   administration (equal); resources (equal); software (equal);
   supervision (equal); validation (equal); visualization (equal); writing
   – original draft (equal); writing – review and editing (equal).
   Chung‐En Ni: Validation (equal). Yen‐Hua Huang: Conceptualization
   (equal); funding acquisition (equal); investigation (equal);
   methodology (equal); project administration (equal); supervision
   (equal); validation (equal); writing – original draft (equal); writing
   – review and editing (equal).

FUNDING INFORMATION

   This work was supported by grants from the Ministry of Science and
   Technology (MOST), Taiwan (MOST108‐2320‐B‐010‐041,
   MOST109‐2221‐E‐010‐017‐MY3), and the Higher Education Sprout Project by
   the Ministry of Education (MOE), Taiwan (108 AC‐D102).

CONFLICT OF INTEREST STATEMENT

   The authors declare that they have no conflicts of interest.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE

   Not applicable.

Supporting information

   Figure S1–S17.
   [251]Click here for additional data file.^ (445.1KB, pdf)

   Table S1.
   [252]Click here for additional data file.^ (18.1KB, xlsx)

   Table S2.
   [253]Click here for additional data file.^ (11.6KB, xlsx)

   Table S3.
   [254]Click here for additional data file.^ (11.8KB, xlsx)

   Table S4A–C.
   [255]Click here for additional data file.^ (11.5KB, xlsx)

   Table S5.
   [256]Click here for additional data file.^ (50.8KB, xlsx)

ACKNOWLEDGMENTS