Abstract Lung adenocarcinoma (LUAD) is one of the sole causes of death in lung cancer patients. This study combined with single-cell RNA-seq analysis to identify tumor stem-related prognostic models to predict the prognosis of lung adenocarcinoma, chemotherapy agents, and immunotherapy efficacy. mRNA expression-based stemness index (mRNAsi) was determined by One Class Linear Regression (OCLR). Differentially expressed genes (DEGs) were detected by limma package. Single-cell RNA-seq analysis in [36]GSE123902 dataset was performed using Seurat package. Weighted Co-Expression Network Analysis (WGCNA) was built by rms package. Cell differentiation ability was determined by CytoTRACE. Cell communication analysis was performed by CellCall and CellChat package. Prognosis model was constructed by 10 machine learning and 101 combinations. Drug predictive analysis was conducted by pRRophetic package. Immune microenvironment landscape was determined by ESTIMATE, MCP-Counter, ssGSEA analysis. Tumor samples have higher mRNAsi, and the high mRNAsi group presents a worse prognosis. Turquoise module was highly correlated with mRNAsi in TCGA-LUAD dataset. scRNA analysis showed that 22 epithelial cell clusters were obtained, and higher CSCs malignant epithelial cells have more complex cellular communication with other cells and presented dedifferentiation phenomenon. Cellular senescence and Hippo signaling pathway are the major difference pathways between high- and low CSCs malignant epithelial cells. The pseudo-temporal analysis shows that cluster1, 2, high CSC epithelial cells, are concentrated at the end of the differentiation trajectory. Finally, 13 genes were obtained by intersecting genes in turquoise module, Top200 genes in hdWGCNA, DEGs in high- and low- mRNAsi group as well as DEGs in tumor samples vs. normal group. Among 101 prognostic models, average c-index (0.71) was highest in CoxBoost + RSF model. The high-risk group samples had immunosuppressive status, higher tumor malignancy and low benefit from immunotherapy. This work found that malignant tumors and malignant epithelial cells have high CSC characteristics, and identified a model that could predict the prognosis, immune microenvironment, and immunotherapy of LUAD, based on CSC-related genes. These results provided reference value for the clinical diagnosis and treatment of LUAD. Keywords: mRNAsi, Lung adenocarcinoma, Single-cell RNA-seq analysis, Immune microenvironment, WGCNA Subject terms: Cancer, Computational biology and bioinformatics Introduction Lung cancer is a malignant tumor that originates in the mucous membranes or glands of the bronchus and is the leading cause of cancer-related death^[37]1. According to the latest data from the Global Cancer Survey 2020, there were 2,206,771 new cases of lung cancer and 1,796,144 deaths worldwide in 2020, making it the second most common cancer and the leading cause of cancer death^[38]2. Lung cancer consists of non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC), of which non-small cell lung cancer accounts for about 80–85% of lung cancer cases^[39]3, lung adenocarcinoma (LUAD) is the most common pathological subtype, accounting for about 50% of non-small cell lung cancer. It is characterized by complex mechanisms, strong aggressiveness, and poor prognosis^[40]4,[41]5. In the last decade, there have been new advances in the treatment of LUAD, including surgical treatment, radiotherapy, chemotherapy and targeted combination therapy. However, due to the occult nature of the disease, most of the patients with LUAD were diagnosed at an advanced stage and could not be treated with surgery. However, after the use of other drugs for radiotherapy and chemotherapy, the prognosis of LUAD patients is still poor, and the 5-year survival rate is less than 20%^[42]6. Therefore, exploring the prognostic markers of LUAD has become the top priority of current scientific research. Recent studies have shown that tumor growth may be driven by a small group of cells called cancer stem cells (CSCs). These cells may generate tumor host cells through self-renewal and multidirection differentiation, maintain tumor growth and heterogeneity, and are also called cancer initiating cells. CSCs have a pioneering immunosuppressive effect at the time of tumorigenesis, and gradually lose this ability during differentiation into astrocytes and oligodendrocytes. In addition, CSCs are believed to be extremely resistant to treatment, leading to multiple treatment failures, including immunotherapy. mRNA expression-based stemness index (mRNAsi), the stemness index of the transcriptome calculated by the OCLR algorithm, could be used to evaluate stemness. Higher mRNAsi scores, as reflected by histopathological grades, are associated with active biological processes in CSCs and with more differentiated tumors^[43]7. Single-cell RNA sequencing (scRNA-seq) research has increasingly focused on the natural progression of cancer. A mouse model of esophageal squamous cell carcinoma (SQUamous cell carcinoma) induced by chemical carcinogen 4-nitroxylin 1-oxide (4-NQ0) was constructed to simulate the animal model of human esophageal carcinoma. The evolutionary trajectory of esophageal epithelial carcinoma from normal and precancerous lesions to invasive carcinoma was described in detail by single-cell transcriptome sequencing^[44]8. Single-cell analysis of precancerous lesion samples from gastric, pancreatic, and colorectal cancer showed that precancerous cells were also highly heterogeneous, with significant dynamic changes in cell composition and expression program^[45]9,[46]10. This study attempted to use single cell RNA analysis to identify malignant epithelial cells with high CSC, search for genes related to CSC, and construct prognostic models, hoping to predict patients' prognosis, immune status, and immunotherapy strategies. Results Difference analysis of mRNAsi in transcriptome datasets As described in method, mRNAsi of samples in TCGA-LUAD dataset were calculated, and higher mRNAsi in tumor samples were observed than that in normal samples (Fig. [47]1A). There were 12,525 upregulated genes and 6970 downregulated genes in Tumor vs. Normal (Fig. [48]1B). Tumor samples in TCGA-LUAD dataset were divided into high mRNAsi group (167 cases) and low mRNAsi group (333 cases) based on mRNAsi median value (0.4742294). we also found 9184 genes with increased expressions and 10,307 genes with decreased expressions in high group in comparison to low group (Fig. [49]1C). GO analysis in 19,491 DEGs showed regulation of hormone levels, axonogenesis, cell–substrate junction, DNA-binding transcription factor binding were enriched (Fig. [50]S1A). KEGG analysis enriched to MAPK signaling pathway, human papillomavirus infection, neuroactive ligand-receptor interaction pathways (Fig. [51]S1B). Figure 1. [52]Figure 1 [53]Open in a new tab mRNAsi differences in the transcriptome of lung adenocarcinoma. (A) mRNAsi was higher in tumor samples than that in normal samples in TCGA dataset. (B) Volcano map of differentially expressed genes between tumor samples and normal samples in TCGA dataset. (C) Volcano map of differentially expressed genes between high- and low- mRNAsi tumor groups in TCGA dataset. (D–F) The survival times in high mRNAsi group was shorted than that in low mRNAsi group in TCGA dataset, [54]GSE31210 dataset and [55]GSE50081 dataset. (G–I) Differences in clinical features of high and low mRNAsi groups in TCGA dataset, [56]GSE31210 dataset and [57]GSE50081 dataset. Moreover, samples in high mRNAsi group had a less survival times than that in low mRNAsi group in TCGA-LUAD dataset (p = 0.042) (Fig. [58]1D), [59]GSE31210 dataset (p = 0.009) (Fig. [60]1E), [61]GSE50081 dataset (p = 0.043) (Fig. [62]1F). Differences analysis of clinical features between high- and low-mRNAsi group showed Gender, T.Stage, M.Stage and Stage had significantly various in TCGA-LUAD dataset (Fig. [63]1G), Stage and OS in [64]GSE31210 dataset (Fig. [65]1H). but there were no differences in [66]GSE50081 dataset (Fig. [67]1I). WGCNA To further screen mRNAsi related genes, WGCNA analysis was performed using mRNAsi score in TCGA-LUAD dataset. When soft threshold = 4 (Fig. [68]S2A), 6 genes modules were determined (Fig. [69]S2B). Correlation analysis between mRNAsi and 6 modules showed turquoise module was higher associated to mRNAsi (cor = 0.81, p = 3e−86) (Fig. [70]2A). A positive phenomenon was observed between gene significance for mRNAsi and module membership in turquoise module (cor = 0.81, p = 1e−200) (Fig. [71]2B). GO and KEGG analysis showed that the turquoise module genes were mainly enriched into many biological processes related to cell proliferation, such as DNA replication, mitosis, and organelle repair (Fig. [72]2C). Figure 2. [73]Figure 2 [74]Open in a new tab Weighted Co-Expression Network Analysis (WGCNA). (A) The module-trait relationships between mRNAsi and 6 modules. (B) Correlation analysis between gene significance for mRNAsi and module membership in turquoise module. (C) GO and KEGG analysis of genes in turquoise module. Single cell analysis of CSCs Single cells in [75]GSE123902 dataset were performed for dimension reduction and annotation analysis, and 8 cell subtypes were obtained (Fig. [76]3A). To identify the malignant tumor components in epithelial cells, we extracted epithelial cell subtypes for infercnv analysis, in which only cluster4,16 of epithelial cells were normal epithelial cells (Fig. [77]3B). To further clarify the stem differences in malignant epithelial cells, the malignant epithelial cells were extracted for CytoTRACE analysis, and high CSCs malignant epithelial cells and low CSCs malignant epithelial cells were defined based on the median CytoTRACE score (Fig. [78]3C). Cellcghat analysis indicated that high CSCs malignant epithelial cells had a more complex cellular communication with other cells (Fig. [79]3D). Pseudo-time series analysis was further used to explore the developmental trajectories of the high CSCs malignant epithelial cells and low CSCs malignant epithelial cells, and the results showed that malignant tumor cells developed from low CSCs to high CSCs malignant, indicating that there is a biological process of dedifferentiation with the progression of tumors (Fig. [80]3E,F). Figure 3. [81]Figure 3 [82]Open in a new tab scRNA analysis in [83]GSE123902 dataset. (A) 8 type cells were annotated in [84]GSE123902 dataset. (B) Identification of malignant components in epithelial cells using Infercnv package. (C) Cytotrace package identified the high and low cancer stemness cell groups in malignant epithelial cells. (D) Cell communication analysis among high and low CSC malignant epithelial cells with other immune cells. (E,F) Pseudotemporal analysis of high and low CSC malignant epithelial cells. Key subtypes were identified by WGCNA CellCall analysis on low CSCs malignant and high CSCs malignant vs. other cell subtypes implied that Cellular senescence pathway and Hippo signaling pathway were main difference pathways (Fig. [85]4A). Moreover, cluster1, 2 only existed in high CSCs malignant group (Fig. [86]4B). WGCNA analysis indicated that bule module enriched in cluster1, 2 (Fig. [87]4C,D). The pseudo-time series analysis showed that cluster1 and cluster2 were concentrated at the end of differentiation trajectory (Fig. [88]4E,F), which was consistent with our previous studies and further verified the biological characteristics of dry dedifferentiation of LUAD. cluster1 and cluster2 were defined as High epi group, and MIF-(CD74 + CD44) were increased in High epi group (Fig. [89]4G). Figure 4. [90]Figure 4 [91]Open in a new tab Hub cluster in CSC malignant epithelial cells through hdWGCNA. (A) CellCall analysis determined pathway differences between high and low CSC malignant epithelial cells. (B) The distribution of 12 clusters in high and low CSC malignant epithelial cells. (C,D) hdWGCNA found that blue modules were significantly enriched in cluster1 and custer2. (E,F) Pseudotemporal analysis of cluster1 and custer2. (G) Cellchat analysis showed communication differences of high CSC malignant epithelial cells. Construction of prognosis model based on machine learning 13 key genes were obtained by intersection of top200 genes in hdWGCNA, genes in turquoise module, DEGs in high vs. low mRNAsi group and DEGs in tumor vs. normal samples (Fig. [92]5A). In TCGA-LUAD dataset, 101 prognosis models were detected by LOOCV frame and c index of 101 models were calculated in TCGA-LUAD dataset, [93]GSE50081 dataset, [94]GSE3210 dataset. Among which, average c index was highest (0.701) of CoxBoost + RFS model (Fig. [95]5B). 6 hub genes (SUB1, POLD2, ELOVL6, TNNT1, PPIA, IRX2) were screened. KM survival analysis demonstrated that samples in high group had a less survival time in TCGA-LUAD dataset, [96]GSE50081 dataset, [97]GSE3210 dataset (Fig. [98]5C–E). In addition, based on single-cell data, hub gene positioning was further defined, and the results showed that 6 genes were significantly highly expressed in high CSCs malignant epithelial cells (Fig. [99]5F). Figure 5. [100]Figure 5 [101]Open in a new tab Construction of prognosis model. (A) Venn diagram of differentially expressed genes. (B) 101 prognostic prediction models were built by machine learning constructs. (C–E) KM survival curve of prognosis model in TCGA dataset, [102]GSE31210 dataset and [103]GSE50081 dataset. (F) Hub gene localization in single cell subpopulation. Immune microenvironment landscape analysis basis on prognosis model ESTIMATE analysis showed that ImmunScore, StromalScore and ESTIMATEScore were enhanced in high group that those in low group (Fig. [104]6A–C). PurityScore was decreased in high group, indicating a higher tumor malignancy (Fig. [105]6D). Subsequently, CIBROSORT, EPIC, MCP-counter and TIMER analyses also verified that there was significant immunosuppression in the high group (Fig. [106]6E). Figure 6. [107]Figure 6 [108]Open in a new tab Immune microenvironment analysis. (A) ImmuneScore differences between high- and low-risk group. (B) StromalScore differences between high- and low-risk group. (C) PurityScore differences between high- and low-risk group. (D) ESTIMATEScore differences between high- and low-risk group. (E) CIBROSORT, EPIC, MCP-counter, TIMER analysis between high- and low- risk group. Relationship between 6 hub genes and immunity, pathways immunity, pathway The ESTIMATE and MCP-counter methods were used to evaluate the immune scores of samples from [109]GSE75214 dataset, and the ssGSEA method was used to evaluate the scores of 28 immune cells corresponding to each sample. Next, the Pearson correlations between 6 hub genes and these immune scores were calculated and visualized, among which, except SUB1 and IRX2, other 4 genes were negatively correlated with major immune killer cells (Fig. [110]7A). The Pearson correlations between 11 pathways scores and 6 hub genes indicated that all genes were negatively to APICAL_JUNCTION pathway (Fig. [111]7B). Figure 7. [112]Figure 7 [113]Open in a new tab Correlation analysis between hub genes and immunity/pathways. (A) Correlation analysis between hub genes and immunity in TCGA dataset. (B) Correlation analysis between hub genes and pathways in TCGA dataset. Predictive analysis of chemotherapy drugs and immunotherapy In TCGA-LUAD dataset, drug sensitivity prediction analysis of prognostic model showed low-risk group was benefit from AS601245, Nilotinib, AZD6482, AP.24534 (Fig. [114]8A–D). The high-risk group showed better sensitivity to Docetaxel, JNK.9L, Bortezomib, and Paclitaxel (Fig. [115]8E–H), which provided a direction for later treatment. In IMvigor210 dataset, patients in the high-risk group treated with PD-L1 had a worse prognosis (p = 0.0023, Fig. [116]8I). RiskScore of SD/PD samples were higher than that in CR/PR samples (Fig. [117]8J). High-risk group had more PD/SD samples (Fig. [118]8K). In stage III-IV patients, the high-risk group had a worse prognosis (p = 0.0016, Fig. [119]8L). Figure 8. [120]Figure 8 [121]Open in a new tab Prognostic model to predict the efficacy of chemotherapy drugs and immunotherapy. (A–H) IC50 differences of chemotherapy drugs between high- and low- risk groups. (I) The survival times in high-risk group was worse in IMvigor210 dataset. (J) Differences in RiskScore between CR/PR and PD/SD responses in the IMvigor210 cohort. (K) Distribution of immunotherapy response between high- and low- risk groups in the IMvigor210 cohort. (L) The high-risk group of III-IV patients had a worse prognosis. Discussion Today, CSCs are seen as drivers of tumor establishment and growth and are often associated with aggressive, heterogeneous, and treatment-resistant tumors^[122]11–[123]14. In colon cancer, recent studies in mice have shown that even differentiated intestinal epithelial cells may act as potential CSCS^[124]15. Epithelial cell adhesion molecules (EpCAM, CD326) are expressed on CSCS of multiple tumor types, including colon and liver cancer^[125]16,[126]17. CSC is found in almost all solid tumors^[127]18. Motivated by these observations, In LUAD, we hypothesize that CSC is associated with malignant epithelial cells. Using transcriptome data of LUAD, it was found that tumor samples had higher mRNAsi, and samples in the high-dry group had worse prognosis. WGCNA analysis showed that turquoise modules were highly correlated with mRNAsi and were associated with biological processes such as cell proliferation. scRNA analysis identified 12 clusters of epithelial cells, and malignant tumor cells developed from Low CSCs to High CSCs. hdWGCNA indicated that blue modules are significantly enriched in cluster1 and Cluster2, and there are differentiation trajectories at the end. Cellcall analyzed the pathway differences between High CSCs malignant and low CSCs malignant and other cell subpopulations, the results showed that cellular senescence and Hippo signaling pathway were the major difference pathways. Dysregulation of the Hippo signaling pathway is associated with cancer progression, including aberrant expression and activity of YAPs and TAZs, and deficiencies in large tumor suppressor kinase 1/2 (LATS1/2)^[128]19–[129]21. The role of YAP/TAZ in cancer stem cells and tumour recurrence is supported by recent evidence^[130]22,[131]23. In addition, Hippo signaling pathway is mainly concentrated in endothelial cells, which may be closely related to angiogenic mimicry^[132]24,[133]25. A CSC-like phenotype can be acquired by epithelial-mesenchymal transition (EMT) programs or by escaping from senescence^[134]26. These results suggest that Cellular senescence and Hippo signaling pathway may be involved in the deterioration of epithelial cells. A prognostic model was constructed based on machine learning and six key genes (SUB1, POLD2, ELOVL6, TNNT1, PPIA, IRX2) were screened. Several studies have shown that POLD2 is aberrantly expressed in multiple cancers, including ovarian carcinoma^[135]27 and glioblastoma^[136]28. Accumulating evidence has demonstrated that ELOVL6 is high-expressed and serves as a negative clinical predictor in a plenty of carcinomas^[137]29,[138]30. TNNT1 has been reported to contribute to the progression of colorectal cancer^[139]31 and breast cancer^[140]32, colon adenocarcinoma^[141]33. PPIA has been implicated in a broad range of pathological processes, including inflammatory diseases, aging and the progression of cancer metastasis^[142]34. Previous studies have demonstrated that overexpression of PPIA plays key roles in different types of cancer, including hepatocellular carcinoma, lung cancer, pancreatic cancer, breast cancer, colorectal cancer, squamous cell carcinoma and melanoma^[143]35. This study inevitably has some limitations. Firstly, our research data came from a public database, not our own. Although the validation set is sufficient to support the conclusions of our study, further validation of the prognostic and therapeutic effects of this model from our own center using a large sample size is needed in the future. Then, further functional experiments will be required to elucidate the biological mechanisms of these genes in lung adenocarcinoma stemness and TME landscape, and to determine whether they could be targeted to improve the effectiveness of immunotherapies and chemotherapies, Thirdly, the stem cell dataset (PCBC dataset) of prostate cancer was applied to lung adenocarcinoma, which does not have pluripotent stem cell data sets, is worthy of further study and exploration of its appropriateness and universality. In summary, analysis of both scRNA-seq and bulk RNA-seq in LUAD samples showed the CSC characteristics of the cancer transformation process from epithelial cell. Based on differentially correlated CSC-related genes, we constructed prognostic and immune-related models. We suggested that our stemness model has future clinical implications for prognostic evaluation and may help clinicians to select likely responders for prioritised use of current immune checkpoint inhibitors. Methods Data acquisition and processing The [144]GSE123902 single-cell dataset was downloaded from the Gene Expression Omnibus (GEO) database ([145]https://www.ncbi.nlm.nih.gov/geo), and samples were obtained from 8 patients with primary lung adenocarcinoma, 3 patients with brain metastases, 1 patient with bone metastases, 1 patient with adrenal metastases, and 4 patients with normal lung tissue, a total of 41,384 cells were obtained. PercentageFeatureSet function R package Seurat ([146]https://satijalab.org/seurat/) is used to calculate the percentage of mitochondria, ribosomes and erythrocytes. Cells were selected with more than 300 expressed genes, less than 15% mitochondrial gene expression and less than 1% erythrocyte gene proportion. Then, the combined scRNA-seq data was normalized, and the Top 2000 highly variable genes were found by FindVariableFeature function R package Seurat, and the ScaleData function R package Seurat was used to scale all genes, and the RunPCA function was used to reduce the dimensionality of the Top 2000 highly variable genes selected. Batch correction is then performed using the harmony algorithm. The “FindNeighbors” and “FindCluster” functions (resolution = 0.8) R package Seurat are used to cluster cells when dim = 20. Next, we use the RunUMAP method for further dimensionality reduction. Finally, we screened the marker genes (Table [147]1) of subpopulation using the FindAllMarkers function, annotated and visualized them using references^[148]36 and cellmarker2.0^[149]37. Tumor cell identification