Abstract

Background

   The heterogeneity of colorectal cancer (CRC) and its complex immune
   microenvironment pose significant challenges for treatment.
   Understanding the cellular composition and dynamic changes is essential
   for uncovering mechanisms of tumour progression.

Methods

   To investigate the cellular heterogeneity and immune microenvironment
   of CRC, identifying critical subpopulations, functional pathways, and
   prognostic biomarkers, single-cell transcriptomic data from 41 CRC
   samples across four datasets were integrated. Bioinformatic analyses
   identified cellular subpopulations, cell communication networks, and
   prognostic biomarkers. The expression patterns, clinical significance
   and biological function of TUBB were validated in vitro.

Results

   A distinct epithelial subpopulation with proliferative and invasive
   features was identified, promoting tumour progression by resisting
   apoptosis and remodelling the extracellular matrix. ActMono, a terminal
   state of myeloid cells, was enriched in tumours and linked to disease
   progression. Cell communication analysis highlighted galectin
   signalling in immune regulation. A prognostic model (CRS) based on
   secretory immune cell-related genes identified TUBB as a key molecule
   influencing the cell cycle and extracellular matrix remodelling, with
   its expression patterns, clinical significance and biological effects
   validated in vitro.

Conclusion

   This study reveals critical subpopulations, signalling pathways, and
   biomarkers in CRC, providing insights into tumour progression and
   potential therapeutic strategies.

   Keywords: single-cell sequencing, colorectal cancer, tumour
   microenvironment, cellular heterogeneity, immune regulation, TUBB

1. Introduction

   Colorectal cancer (CRC) ranks as the third most common malignancy
   globally and represents a leading cause of cancer-related mortality.
   Despite surgical resection combined with chemotherapy being the
   standard treatment protocol, approximately one-third of patients
   experience disease recurrence ([35]1, [36]2). While immune checkpoint
   inhibitors have shown significant efficacy in microsatellite
   instability-high (MSI-H) tumours and combined EGFR/BRAF inhibitor
   therapy has proven effective in BRAF V600E-mutant CRC, these treatments
   are only applicable to specific patient subgroups ([37]3–[38]5).
   Large-scale gene expression studies have established molecular
   classification systems for CRC, most notably the Consensus Molecular
   Subtypes (CMS), which categorizes CRC into four subtypes: CMS1–4
   ([39]6). However, these classifications, which are primarily based on
   bulk sequencing data, cannot precisely resolve the complex cellular
   heterogeneity within the tumour microenvironment.

   The development of CRC involves the accumulation of mutations in
   multiple oncogenes and tumour suppressor genes (such as APC, KRAS, and
   PIK3CA) and microsatellite instability caused by DNA mismatch repair
   gene dysfunction ([40]3, [41]7). Although high tumour mutational burden
   (TMB) and MSI status can predict the response to immune checkpoint
   inhibitor therapy, only a minority of patients respond to PD-1
   inhibitor treatment ([42]8, [43]9). The complex molecular heterogeneity
   and microenvironmental characteristics of CRC not only influence
   disease progression but also present significant challenges for
   precision medicine, highlighting the importance of understanding the
   CRC microenvironment in detail. To date, there has not been a
   comprehensive and systematic characterization of how tumour and TME
   cells shape the tumoural, stromal, and immune landscapes to form
   specific CRC subtypes.

   Recent single-cell studies have revealed cellular heterogeneity in the
   CRC microenvironment and identified multiple functionally important
   specific cell subgroups ([44]10–[45]13). While these studies have
   provided new perspectives for understanding tumour progression
   mechanisms and immune evasion, their geographical limitations and
   sample sizes make fully characterizing the shared mechanisms within the
   CRC microenvironment difficult. Cross-study comparisons are also
   challenging due to varying cell annotation methods across different
   studies. In this study, we integrated four datasets from public
   databases, encompassing 41 samples, to systematically describe the
   differential cell population distributions and intercellular
   interaction networks between tumour and normal tissues. We not only
   revealed the heterogeneous characteristics and transcriptional
   reprogramming of epithelial cells in tumour tissues but also identified
   several key cell subgroups potentially involved in the formation of an
   immunosuppressive microenvironment, providing new insights into CRC
   progression mechanisms and the immune microenvironment.

2. Materials and methods

2.1. Data collection

   Data for TCGA-COAD and TCGA-READ were downloaded from the UCSC Xena
   platform ([46]https://xenabrowser.net/datapages/). Single-cell RNA
   sequencing (scRNA-seq) datasets were obtained from the GEO database
   (accession numbers: [47]GSE161277, [48]GSE200997, [49]GSE221575, and
   [50]GSE231559). The combined dataset included samples from 27 primary
   colorectal cancer (CRC) patients and 14 normal control samples. After
   quality control (QC), a total of 88,212 high-quality single cells were
   retained for further analysis.

2.2. Data processing

   Single-cell RNA-seq data were preprocessed using the Seurat v4.3.0 R
   package. Quality control (QC) was performed to remove low-quality cells
   and potential dying cells. Specifically, we retained cells that
   expressed at least 400 genes, and excluded cells with >20%
   mitochondrial gene expression. These thresholds were selected based on
   the distribution of QC metrics and previous studies, aiming to balance
   data completeness with the removal of low-quality cells
   ([51]14–[52]16).

   To detect and remove potential doublets, we applied DoubletFinder
   v2.0.3. The expected number of doublets was calculated based on an
   assumed doublet rate of ~7.5–8%, following 10X Genomics guidelines, and
   using the formula: nExp_poi = round(0.08 × N × N/10000), where N is the
   number of cells in the sample. For doublet prediction, we used 20
   principal components (PCs = 1:20) and the following parameters: pN =
   0.25, pK = 0.09, nExp = nExp_poi, reuse.pANN = FALSE, sct = FALSE.
   These settings were based on the recommended defaults in the official
   DoubletFinder tutorial. Predicted doublets were removed from the
   dataset prior to downstream analysis.

   Data normalization, identification of highly variable genes, principal
   component analysis (PCA), and unsupervised clustering were performed
   using Seurat’s standard pipeline. Harmony v1.2.3 was used for batch
   correction and data integration, with sample identity specified as the
   batch variable. The RunHarmony() function was executed with default
   parameters, and the top 30 PCs were retained for downstream analysis.
   UMAP was used for dimensionality reduction and visualization.

   For differential expression analysis, we used the FindAllMarkers()
   function in Seurat with the Wilcoxon rank-sum test, applying the
   following thresholds: min.pct = 0.25, logfc.threshold = 0.25, only.pos
   = FALSE. Significant differentially expressed genes (DEGs) were defined
   as those with p-value < 0.05.

2.3. Cell type identification

   To identify cell types, we first performed differential expression
   analysis across clusters using the FindAllMarkers() function in Seurat.
   Marker genes were defined as those with an adjusted p-value < 0.05,
   expression in more than 25% of cells within the cluster (min.pct =
   0.25), and an absolute log2 fold change > 0.25. For each cluster, the
   top-ranked differentially expressed genes were considered
   cluster-specific highly expressed genes.

   We then compared these cluster-specific markers against curated
   reference databases, including CellMarker and PanglaoDB, to determine
   the most likely cell type identity for each cluster. Annotation was
   conducted manually based on the expression patterns of canonical
   lineage markers and known cell-type-specific genes.

   To support and cross-validate our manual annotations, we additionally
   employed the SingleR package, which uses reference transcriptomic
   datasets to infer cell identities. The results from SingleR were used
   as a secondary reference and were reconciled with our primary
   marker-based annotation strategy.

   Furthermore, we calculated Spearman correlation coefficients between
   the average expression profiles of all clusters to evaluate
   transcriptional similarity. Clusters with highly correlated expression
   patterns and overlapping marker gene expression were considered for
   subtype merging to avoid artificial over-segmentation. Final cell type
   labels were determined by integrating information from marker gene
   analysis, database matching, SingleR prediction, and inter-cluster
   correlation.

2.4. Cell communication analysis

   Cell-cell communication networks within the tumour microenvironment
   were inferred using the CellChat v1.1.3 R package based on
   receptor-ligand interactions ([53]17). Communication probability and
   the number of interactions were calculated to construct communication
   networks. The interactions between any two cell populations were
   visualized, and scatter plots were generated to display the major
   signalling senders (signal sources) and receivers (targets) in a
   two-dimensional space, helping to identify the main contributors of
   outgoing or incoming signals, particularly among immune cell types. A
   pattern recognition approach was used to identify how multiple immune
   cell types and signalling pathways coordinate.

2.5. Pseudotime trajectory analysis

   Pseudotime trajectories were constructed using the Monocle 2 algorithm,
   an R package designed for single-cell trajectory analysis by Qiu et al.
   ([54]18). This algorithm reduces high-dimensional gene expression
   profiles into a low-dimensional space and arranges the cells into
   trajectories with branching points. Dynamic expression heatmaps were
   constructed using the plot_pseudotime_heatmap function.

2.6. Machine learning-based feature signature identification

   A consensus feature signature was derived using a machine
   learning-based integrative approach that combined ten different
   algorithms: Random Survival Forest (RSF), Elastic Net (Enet), Lasso,
   Ridge, Stepwise Cox, CoxBoost, Partial Least Squares for Cox (plsRcox),
   Supervised Principal Components (SuperPC), Generalized Boosted
   Regression Model (GBM), and Survival Support Vector Machines
   (survival-SVM). To ensure robustness, a consensus model was constructed
   by integrating predictions from these methods.

   A total of 101 algorithmic combinations were executed within a
   leave-one-out cross-validation (LOOCV) framework, optimizing feature
   selection and model performance. Hyperparameters for each algorithm
   were tuned using grid search/random search (or specify your method),
   and model performance was evaluated based on the concordance index
   (C-index) and other relevant metrics (e.g., AUC, log-rank test).The
   TCGA-READ and COAD datasets were randomly split into a training set
   (60%) and a validation set (40%), ensuring a balanced distribution of
   clinical and molecular features. Stratified sampling was applied to
   maintain consistency between groups. The final model was assessed on
   the validation set for predictive accuracy and generalizability.

2.7. Immune infiltration evaluation

   The CIBERSORT algorithm was applied to quantify immune cell
   infiltration levels in pancreatic adenocarcinoma (PAAD) patients and to
   explore differences in immune cell abundance between high-risk and
   low-risk patient groups ([55]19). Pearson correlation analysis was
   conducted to assess the relationship between immune cell abundance and
   risk scores. To further investigate potential differences in immune
   function, single-sample gene set enrichment analysis (ssGSEA) was
   employed to obtain enrichment scores, and the Wilcoxon test was used to
   compare immune function between high-risk and low-risk groups ([56]20).

2.8. Quantitative real-time PCR and western blotting

   Total RNA extracted from paired colorectal cancer tissues was
   collected. Then the RNA was reversely transcribed into cDNA with the
   kit (Takara, Dalian, China) and amplified. The sequences of the primers
   are detailed as below: TUBB: 5’-ATTTCTTTATGCCTGGCTTTG-3’ and
   5’-GACCTGCTGGGTGAGTTCC’; GAPDH: 5’-ACACCCACTCCTCCACCTTT-3’ and
   5’-TTACTCCTTGGAGGCCATGT-3’.

   Total cellular protein from clinical samples were lysed with RIPA lysis
   buffer (Solarbio, China) and added with protease inhibitor at a ratio
   of 1:100 (Thermo Scientific, United States). The information of primary
   antibodies was as follows: TUBB (1:1,000, Abmart, TA7011M) and GAPDH
   (1:4,000, Abmart, P30008S).

2.9. Kaplan–Meier plotter database analysis

   We analysed the predictive value of TUBB in CRC using Kaplan–Meier (KM)
   Plotter ([57]https://kmplot.com) ([58]21). Based on the median
   expression (high and low expression), patients were divided into two
   groups to analyse the overall survival (OS) and recurrence-free
   survival (RFS).

2.10. Patients and clinical samples

   59 paired colorectal cancer tissues were obtained from patients who
   underwent colorectal cancer surgery at Yangpu Hospital of Tongji
   University between November 2018 and November 2019. The study got
   approval from Ethics Committee of the Yangpu Hospital (LL-2023-LW-012).
   CRC tissues and paracancerous tissues were collected during surgery and
   immediately frozen in liquid nitrogen to assess the expression levels
   of specific genes and proteins respectively.

2.11. Cell culture and transfection

   Human colorectal (CRC) cell lines (HCT116 and SW620) were acquired from
   Shanghai Institute of Biochemistry and Cell Biology. All cell lines
   were cultured in DMEM medium (Gibco, Carlsbad, CA, USA) which contains
   10% foetal bovine serum (FBS; Gibco) at 37°C with 5% CO2.

   Lipofectamine 3000 (Invitrogen, Carlsbad, CA, USA) was used to
   transfect cells with an siRNA specific for TUBB and a control construct
   purchased from GeneChem (Shanghai, China). Cells were utilized for
   downstream assays at 48h post-transfection. Analyses were conducted in
   triplicate. TUBB overexpression plasmid was customized from GenePharma
   (Shanghai, China).

2.12. Transwell assays and wound healing assay

   Cells were suspended in 250 μL of serum-free medium and seeded into the
   upper chamber of a 24-well Transwell plate (Nest, China). The lower
   chamber was filled with culture medium containing 10% FBS. For invasion
   assays, the Transwell chambers were coated with Matrigel (2 mg/mL) and
   DMEM, whereas for migration assays, they were left uncoated. After 24
   hours of incubation, the invaded cells were fixed with 4%
   paraformaldehyde for 30 minutes and stained with crystal violet for 10
   additional minutes, both at room temperature. Cells were counted in
   five random optical fields of view under a light microscope (Nikon
   Corporation, Japan).

   For the wound healing assay, cells were cultured without FBS in 6-well
   plates for 24 hours. Linear wounds were created by scratching with a 10
   μL pipette tip. Wound closure was monitored and photographed at 0 and
   24 hours using a microscope (Nikon Corporation, Japan).

2.13. Assays of cell proliferation and apoptosis

   To assess the rate of DNA synthesis, CRC cell lines were subjected to
   treatment with 5-ethynyl-2’-deoxyuridine (EDU) at a concentration of 50
   μM, which was subsequently added to the cell culture plates. Following
   a 30-minute incubation, DNA was stained using Hoechst 33342, allowing
   for the visualization of positively stained cells under a microscope.
   HCT116 and SW620 cells, characterized by either TUBB overexpression or
   knockdown, were dissociated into single-cell suspensions using 0.25%
   trypsin. These cells were then stained with Annexin V-APC and
   7-Aminoactinomycin D (7-AAD) to evaluate apoptosis rates, which were
   quantified through flow cytometry analysis.

2.14. Statistical analysis

   All statistical analyses and data visualizations were performed using R
   software (version 4.1.3). Pearson correlation coefficients were used to
   evaluate the correlation between continuous variables. For quantitative
   data, a two-tailed unpaired Student’s t-test or one-way analysis of
   variance (ANOVA) with Tukey’s multiple comparison test was performed to
   compare values between subgroups. When multiple comparisons were
   conducted, p-values were adjusted using the Benjamini-Hochberg (BH)
   method to control the false discovery rate (FDR). A p-value or adjusted
   p-value < 0.05 was considered statistically significant.

3. Results

3.1. Integrated analysis reveals cell type composition and functional
remodelling in the CRC microenvironment

   In this study, we performed an integrated analysis of 41 samples from
   four public datasets ([59]GSE161277, [60]GSE200997, [61]GSE221575, and
   [62]GSE231559) ([63]2, [64]22–[65]24), comprising 27 primary CRC tumour
   samples and 14 normal control samples. After rigorous quality control
   and doublet removal, we obtained 88,212 valid cells for analysis ([66]
   Supplementary Figures S1A, B ). Through single-cell transcriptome
   analysis, which combines specific gene expression profiles and
   classical cell markers, we classified these cells into 32 distinct
   clusters. On the basis of intercluster Spearman correlation analysis
   and previously reported cell markers ([67]14, [68]25), we identified 8
   different cell types ([69] Figures 1A, B , [70]Supplementary Figure S1C
   ), including T cells expressing high levels of CD3D, CD3E, and TRAC; B
   cells expressing CD79A, MS4A1, and CD79B; epithelial cells expressing
   EPCAM, KRT8, and KRT18; plasma cells expressing JCHAIN and SDC1;
   myeloid cells expressing LYZ, MNDA, and C1QA; fibroblasts expressing
   COL1A1, DCN, and COL1A2; endothelial cells expressing CLDN5, CDH5,
   PECAM1, and VWF; and mast cells expressing TPSB2 and TPSAB1 ([71]
   Figure 1C , [72]Supplementary Figures S1D–G ). We calculated the top 50
   highly expressed genes for each cell type to confirm cluster
   specificity and used AUCell to score activated pathways in each
   subgroup, further validating cell identities ([73] Figure 1D ,
   [74]Supplementary Figure S2A ).

Figure 1.

   [75]Figure 1
   [76]Open in a new tab

   Single-cell transcriptomic analysis reveals cell type composition and
   functional remodelling in colorectal cancer. (A) UMAP dimensionality
   reduction showing the integrated cell distribution map. A total of 32
   cell clusters were identified, classified into 8 major cell types, with
   different colours representing distinct cell clusters. (B) Spearman
   correlation heatmap based on gene expression across cell clusters,
   revealing transcriptional similarities between clusters. (C) Expression
   profiles of characteristic marker genes across different cell types.
   (D) Heatmap displaying the top 50 highly expressed genes specific to
   each cell type, illustrating the transcriptional characteristics of
   different cell types. (E) Density distribution plot of each cell type
   in tumour and normal groups, showing differences in cell abundance
   between groups. (F) PHATE dimensionality reduction plot displaying cell
   distribution, with the left panel coloured by cell type and the right
   panel coloured by sample group, revealing spatial distribution patterns
   of cell types and sample groups. (G) Left panel: Stacked bar plot
   showing the proportional distribution of different cell types across
   samples; right panel: Composition ratio of tumour and normal groups
   within each cell type. * indicates statistical significance at p <
   0.05. (H) Statistical summary of the number of differentially expressed
   genes (DEGs) between tumour and normal groups for each cell type. (I)
   Representative GO functional pathways enriched in upregulated DEGs in
   each cell type.

   For the cell composition analysis, we first calculated the cell type
   abundance in the tumour and normal groups and found significant
   enrichment of epithelial cells in the tumour group ([77] Figure 1E ).
   Using the PHATE algorithm ([78]26), which captures both local and
   global nonlinear structures, we identified significant differences
   between tumour and normal epithelial cells that were not fully revealed
   by conventional UMAP analysis ([79] Figure 1F , [80]Supplementary
   Figure S2B ). The cell proportion statistics revealed significant
   intergroup heterogeneity among the samples. Additionally, the
   proportions of epithelial cells, and myeloid cells increased in the
   tumour group, whereas the proportions of B cells and plasma cells
   decreased, suggesting remodelling of the tumour immune microenvironment
   ([81] Figure 1G , [82]Supplementary Figures S1H, I ), with cell
   abundance calculations providing more intuitive visualization of these
   results ([83] Supplementary Figures S2C, D ). We calculated
   differentially expressed genes (DEGs) between the tumour and normal
   groups for each cell type ([84] Figure 1H ), with epithelial cells
   showing the most DEGs, followed by fibroblasts, indicating that these
   two cell types may undergo the most significant phenotypic changes
   during tumour progression. GO analysis of the upregulated DEGs revealed
   that myeloid cells were significantly enriched in cytokine production
   and apoptotic signalling pathways, whereas T cells were enriched mainly
   in NF-κB signalling and cytokine production-related pathways. Notably,
   multiple cell types respond to oxidative stress and hypoxia,
   particularly endothelial cells and fibroblasts, which are actively
   involved in angiogenesis-related pathways, reflecting the complex
   intercellular interaction network in the tumour microenvironment ([85]
   Figure 1I ).

3.2. Multiple cell types exhibit coordinated patterns of metabolic activation
and matrix remodelling

   We performed a systematic analysis of transcriptional characteristics
   across cell types in tumour and normal tissues. First, through Wilcoxon
   rank-sum test analysis of DEGs, we identified changes in several key
   molecules: the chemokine CCL5 was significantly downregulated in
   multiple immune cells (including T cells and myeloid cells), whereas
   CCL20 was upregulated, suggesting a shift in the immune
   microenvironment from an antitumour status to a protumour status
   ([86]27). Moreover, the concurrent upregulation of COL4A1/COL4A2 across
   multiple cell types, particularly in fibroblasts and endothelial cells,
   reflected extracellular matrix remodelling ([87] Figure 2A ). To
   comprehensively understand the changes in the cellular composition
   during disease states, we employed a multilayered analytical strategy.
   Differential abundance analysis based on k-nearest neighbour statistics
   ([88]28) revealed significantly increased proportions of epithelial
   cells, endothelial cells, and myeloid cells in the tumour group ([89]
   Figures 2B, C ), suggesting expansion of the tumour parenchyma and
   active angiogenesis. Furthermore, using the random forest-based Augur
   algorithm ([90]29) to assess transcriptional perturbation levels across
   cell types, we found that endothelial cells, mast cells, fibroblasts,
   and epithelial cells presented the most significant transcriptional
   changes ([91] Figure 2D ). Using AUCell, we calculated the pathway
   involvement of endothelial cells and mast cells in the tumour group and
   found that endothelial cells were involved primarily in extracellular
   matrix remodelling, whereas mast cells were involved mainly in lipid
   metabolism ([92] Supplementary Figures S2E, F ).

Figure 2.

   [93]Figure 2
   [94]Open in a new tab

   Multi-dimensional analysis reveals transcriptomic remodelling and
   functional changes in cell types in colorectal cancer. (A)
   Differentially expressed genes (DEGs) between tumour and normal groups
   across cell types, analysed using the Wilcoxon rank-sum test. Highly
   significant DEGs are defined as those with an adjusted p-value
   (adj.pval) < 0.05, while lowly significant DEGs are defined as those
   with adj.pval < 0.1. “Upregulated”refers to genes with higher
   expression in the tumour group relative to the normal group. (B)
   Differential abundance analysis using the Milo k-nearest neighbour
   (kNN) algorithm. Each node represents a local cellular neighbourhood,
   with colour intensity representing the log fold change of tumour
   relative to normal. White nodes indicate non-significant differences
   (FDR > 10%). The node layout is based on UMAP dimensionality reduction.
   (C) Statistical results of the Milo kNN differential abundance
   analysis. (D) Augur analysis framework assessing the degree of
   transcriptomic perturbation in each cell subtype between biological
   states. A higher AUC value indicates more significant transcriptomic
   alterations. (E) Volcano plot of DEGs in epithelial cells between
   tumour and normal groups. DC values represent protein-protein
   interaction network strength, calculated using the STRING database.
   (F-G) GO functional enrichment analysis of DEGs in epithelial cells.
   (F) Functional annotation of upregulated genes and (G) downregulated
   genes. (H) GSEA pathway enrichment analysis of DEGs in epithelial
   cells, showing representative signalling pathways. NES represents the
   Normalized Enrichment Score.

   Given the dominant position of epithelial cells in samples and their
   significant abundance and transcriptional changes, we conducted
   in-depth functional analysis. GO enrichment analysis of the DEGs
   revealed that the upregulated genes were enriched mainly in protein
   synthesis- and ribosome biogenesis-related pathways, reflecting robust
   growth demands and stress adaptation responses of tumour cells, whereas
   the downregulated genes were enriched in cell polarity- and
   structure-related pathways, suggesting phenotypic alterations ([95]
   Figures 2E–G ). GSEA further confirmed the significant activation of
   three key pathways: RNA processing, translation, and extracellular
   vesicles. These changes not only reflect cancer cell proliferative
   activity and stress adaptation but also indicate tumour
   microenvironment remodelling and potential metastatic tendencies
   suggesting that systematic changes occur across multiple cell types in
   CRC ([96] Figure 2H ).

3.3. Epithelial cell heterogeneity and copy number variation-driven malignant
progression in CRC

   Considering that epithelial cells are the primary source of malignant
   tumour cells in CRC and their significant perturbations at both the
   abundance and transcriptional levels, we conducted an in-depth analysis
   of epithelial cells. Through reclustering, we identified 8 distinct
   epithelial cell subgroups ([97] Figure 3A , [98]Supplementary Figures
   S3A, B ), including stem/progenitor cells (SPCs), secretory transit
   amplifying cells (SecTA), absorptive enterocytes (AEs), goblet cells
   (GCs), cycling transit amplifying cells (CycTA), infiltrating
   immune-like cells (IIIC), enteroendocrine cells (EECs), and BEST4+
   enterocytes (BEST4-ECs). On the basis of previous studies and analyses
   of highly expressed genes in each subgroup ([99]30, [100]31), we
   validated these subgroup annotations ([101] Figures 3C, D ). By
   calculating subgroup-specific highly expressed genes, we not only
   confirmed cell identity specificity but also identified representative
   marker genes, such as the stem cell characteristic factors SOX4, ELF3,
   and KLF5 in the SPCs and VIM and CCL5 in infiltrating immune-like cells
   ([102] Figure 3D ). Additionally, we performed functional enrichment
   analysis on the genes that were specifically expressed in each subgroup
   to validate their functions ([103] Supplementary Figures S3C, D ). When
   comparing tumour and normal tissues, we detected significantly
   increased abundances of SPCs, SecTA, and CycTA in the tumour group,
   indicating the activation of stem cell properties and inflammatory
   responses, whereas decreased abundances of AEs, EECs, and BEST4-ECs
   reflected impaired normal absorption and endocrine functions, revealing
   significantly different epithelial cell states between the two groups
   ([104]32, [105]33) ([106] Figures 3B, E ). Augur framework analysis
   revealed that EECs and SecTA were the two subgroups with the most
   significant transcriptional perturbations ([107] Figure 3F ). GO
   enrichment analysis of DEGs revealed that EEC-related genes were
   enriched mainly in cytoskeleton- and cell junction-related pathways,
   such as cell migration, cadherin binding, and cell–cell junctions,
   changes that might affect their paracrine regulatory function and
   hormone secretion polarity, potentially leading to invasive
   characteristics. SecTA DEGs were enriched mainly in protein synthesis
   and cell adhesion-related pathways, such as cytoplasmic translation and
   cytosolic ribosomes, reflecting significantly increased protein
   synthesis activity and metabolic levels in SecTA within tumour tissue,
   suggesting possible increased secretory protein production ([108]
   Figure 3G ).

Figure 3.

   [109]Figure 3
   [110]Open in a new tab

   Copy number variation (CNV) analysis reveals genomic instability
   features in epithelial cell subtypes of colorectal cancer. (A) UMAP
   showing subtypes of stem/progenitor cells (SPCs), secretory transit
   amplifying cells (SecTA), absorptive enterocytes (AEs), goblet cells
   (GCs), cycling transit amplifying cells (CycTA), infiltrating
   immune-like cells (IIIC), enteroendocrine cells (EECs), and BEST4+
   enterocytes (BEST4-ECs). (B) Stacked bar plot representing the
   proportional distribution of cell types across different groups. (C)
   Expression of marker genes used for the identification of each cluster.
   (D) Heatmap showing the average expression of genes in each epithelial
   cell subtype, with the left panel displaying clustering of
   subtype-specific genes and representative markers. (E) Density plot
   showing the enrichment of cell counts in each group. (F) Augur
   framework displaying transcriptomic perturbation across subclusters
   under two biological conditions, where a higher AUC represents greater
   transcriptomic changes. (G) GO enrichment analysis of differentially
   expressed genes (DEGs) in EEC and SecTA subtypes. (H) InferCNV analysis
   predicting copy number variations in each epithelial cell subtype using
   T cells, B cells, and NK cells as references. (I) CNV scores mapped