Abstract

Background

   Pancreatic cellular heterogeneity is fundamental to systemic metabolic
   regulation, yet its pathological remodeling in diabetes remains poorly
   characterized.

Methods

   We integrated single-cell RNA sequencing with machine learning
   frameworks to decode pancreatic heterogeneity. Novel tools included
   PanSubPred (two-stage feature selection/XGBoost classifier) for
   multi-lineage annotation and PSC-Stat (XGBoost/Gini optimization) for
   stellate cell activation analysis.

Results

   By establishing PanSubPred, we systematically decoded pancreatic
   cellular diversity, identifying 64 cell-type-specific markers (38
   novel) that maintained cross-dataset accuracy (AUC > 0.970) even after
   excluding known canonical markers. Building on this annotation
   precision, we developed PSC-Stat to quantify stellate cell activation
   dynamics, revealing their progressive activation from diabetes to
   pancreatic cancer (activated/quiescent ratio: control: 1.44 ± 1.02,
   diabetes: 4.72 ± 4.01, pancreatic cancer: 18.67 ± 18.70). Diabetes
   reorganized intercellular communication into ductal-centric hubs via
   FGF7-FGFR2/3, EFNB3-EPHB2/4/6 and EFNA5-EPHA2 axes, from which we
   derived a 15-gene signature for diabetic ductal cells (AUC = 0.846).
   Beta cell heterogeneity analysis uncovered diabetes-associated
   depletion of mature insulin-secretory clusters (INS + NKX6-1+),
   expansion of immature (CD81 + RBP4+) and endoplasmic reticulum
   stress-adapted subtypes (DDIT3 + HSPA5+). Moreover, non-beta lineages
   exhibited parallel dysfunction: acinar cells shifted toward
   inflammatory states (CCL2 + CXCL17+), while ductal cells adopted
   secretory phenotypes (MUC1 + CFTR+).

Conclusions

   This study presents a machine learning-based single-cell framework that
   systematically maps pancreatic cellular alterations in diabetes. The
   identified novel signatures, stellate activation dynamics, and beta
   cell maturation trajectories may serve as potential targets for
   diabetic management and pancreatic cancer risk stratification.

Graphical abstract

   [40]graphic file with name 12933_2025_2865_Figa_HTML.jpg

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12933-025-02865-8.

   Keywords: Pancreatic cellular heterogeneity, Single-cell
   transcriptomics, Machine learning, Type 2 diabetes, Stellate cell
   activation, Beta cell dysfunction

Research in context

   What is already known about this subject?
     * Pancreatic cellular heterogeneity underpins systemic metabolic
       regulation and orchestrates glucose homeostasis.
     * The communication network among pancreatic cells is crucial for
       maintaining normal function.
     * Beta cell mass declines in T2D, with beta cells displaying
       heterogeneity between mature and immature subpopulations.

   What is the key question?
     * How does pancreatic molecular heterogeneity, spanning multi-lineage
       cellular states, intercellular crosstalk, and β cell maturation
       gradients, orchestrate metabolic homeostasis, and how do these
       systems collapse in T2D?

   What are the new findings?
     * AI-empowered cellular cartography: We developed PanSubPred and
       PSC-Stat, frameworks enabling high-fidelity annotation of
       pancreatic lineages (64 markers, 38 novel) and stellate cell
       activation.
     * Diabetes-driven communication breakdown: T2D collapses global
       intercellular signaling but strengthens ductal-centric hubs via
       certain pathways, forming a 15-gene novel signature.
     * Multi-lineage pancreatic dysfunction: T2D drives divergent
       transitions across β and non-β cells. Decline in mature β cell
       populations is observed in T2D, coupled with an increase in
       immature β cells.

   How might this impact on clinical practice in the foreseeable future?
     * The AI-driven PanSubPred/PSC-Stat platform accelerates single-cell
       multi-lineage exploration, potentially guiding the development of
       new therapeutic approaches.

Introduction

   The human pancreas, a central hub for systemic metabolic regulation,
   maintains nutrient homeostasis through its dual exocrine and endocrine
   functions [[41]1]. The exocrine pancreas secretes digestive enzymes
   including proteases, lipases, and amylases, to break down meals into
   absorbable small molecules like amino acids, fatty acids, and
   monosaccharides for intestinal absorption [[42]2, [43]3]. The endocrine
   function, orchestrated by dispersed islet cells, dynamically regulates
   nutrient storage and utilization through hormones like insulin and
   glucagon. Insulin drives the conversion of glucose into glycogen or fat
   for storage [[44]4], whereas glucagon not only participates in glucose
   production, but also modulates circulating fatty acids and amino acids
   [[45]5], and it has been shown to play an important role in the
   regulation of hepatic steatosis, which is closely linked to insulin
   resistance in patients with type 2 diabetes (T2D) [[46]6–[47]10].

   Pancreatic function critically depends on a highly heterogeneous
   cellular network. Recent studies reveal that classically defined cell
   types in pancreas exhibit significant subpopulation divergence. For
   instance, beta cells can be subdivided into functionally distinct
   subpopulations based on insulin secretion capacity, functional
   identity, or cell maturity [[48]11–[49]13]. Molecular profiling with
   multi-omics approaches, including single-cell/single-nucleus
   transcriptomics or single-cell chromatin accessibility, has provided
   deep insights into beta cell heterogeneity [[50]14–[51]17].
   Functionally, this heterogeneity enables adaptive responses to
   different metabolic demands: high-secreting beta cell clusters maintain
   insulin level, while stress-resilient subtypes may facilitate recovery
   and regeneration during periods of metabolic stress [[52]18].
   Segerstolpe et al. identified a highly proliferative alpha cell subset
   potentially involved in pancreatic self-renewal mechanisms [[53]19].
   Pancreatic stellate cells (PSCs), a special stromal cell, exhibit two
   phenotypes: quiescent (qPSC) and activated (aPSC) [[54]20]. Quiescent
   PSCs, characterized by vitamin A-rich lipid droplets, locate at the
   peri-acini or around vascular cells, while aPSCs lose these droplets
   and produce substantial extracellular matrix proteins, playing a
   pivotal role in remodeling the tumor microenvironment to support
   pancreatic ductal adenocarcinoma (PDAC) [[55]21, [56]22]. Additionally,
   pancreatic acinar and ductal cells also display molecularly distinct
   subpopulations [[57]19, [58]21]. Such cellular heterogeneity
   complicates the molecular studies of human pancreases.

   Single-cell RNA sequencing (scRNA-seq) analysis enables transcriptomic
   profiling of individual cells within heterogeneous tissues
   [[59]23–[60]25], offering deep insights into pancreatic cellular
   diversity. While existing studies have mapped cell-type-specific
   molecular profiles in both non-diabetic (ND) and diabetic individuals
   [[61]21, [62]26, [63]27], discrepancies in marker genes, especially in
   the exocrine cells of the pancreas, hinder cross-dataset comparisons
   and in-depth functional analyses of subpopulations. Machine learning
   algorithms now provide a breakthrough: supervised probabilistic
   prediction frameworks integrate multi-source data to build robust
   classification models, facilitating cell annotation and discovery of
   novel cell markers [[64]28, [65]29]. Nevertheless, although pancreatic
   cellular heterogeneity (e.g., distinct subpopulations [[66]19, [67]21])
   is well-documented, its functional contribution to T2D pathogenesis
   remains poorly characterized.

   In this study, we integrate high-resolution scRNA-seq data with machine
   learning to systematically map the heterogeneity of human pancreatic
   exocrine and endocrine cells. We developed PanSubPred, a novel
   transcriptomics-based pancreatic cell classifier, which enabled
   high-precision multi-lineage annotation and identified 64
   cell-type-specific markers, including 38 novel ones. Leveraging this
   accurate annotation, we quantified stellate cell activation dynamics
   using PSC-Stat, revealing a progressive activation continuum from
   diabetes to PDAC. Intercellular communication analysis uncovered
   diabetes-driven reorganization into ductal-centric signaling hubs and a
   novel 15-gene signature. Furthermore, we investigated dynamic changes
   in subpopulation heterogeneity across pancreatic cell types, including
   beta and non-beta cells, collectively linking multi-lineage
   dysregulation to T2D progression.

Methods

Data curation

   We retrieved the publicly available human pancreatic datasets from the
   National Center for Biotechnology Information (NCBI) Gene Expression
   Omnibus (GEO) and ArrayExpress (Table [68]S1) [[69]16, [70]19, [71]21,
   [72]30–[73]34]. Donors previously diagnosed with T2D or with an
   HbA1c ≥ 6.5 were classified as T2D, while those without a previous T2D
   diagnosis and with an HbA1c ≤ 5.6 were classified as ND [[74]35]. To
   construct PanSubPred, the dataset from Baron et al. (containing 6749
   cells) was randomly partitioned into a training set (4723 cells, 70%)
   and an independent internal testing set (2026 cells, 30%) at a 7:3
   ratio. Additionally, T2D cells from the Baron et al. dataset were also
   used as an internal test set, while datasets from Segerstolpe et al.,
   Enge et al., and Ngara et al. served as external validation cohorts
   (Table [75]S2). The PSC phenotype classifier (PSC-Stat) was developed
   with 400 pancreatic stellate cells (168 qPSCs and 232 aPSCs) from Baron
   et al., split into a training set (280 cells, 70%) and a internal
   testing set (120 cells, 30%). Diabetic-status cells from Baron et al.
   were also used as an internal test set, while datasets from Fang et
   al., Kang et al., and PDAC data from Steele et al. served as external
   validation cohorts (Table [76]S7). Ductal cell profiles extracted from
   Fang et al. were randomly divided into training (80%) and independent
   testing (20%) sets (Table [77]S9).

Raw data processing and quality control

   For the raw reads from ArrayExpress (accession number: E-MTAB-5061) and
   GEO (accession number: [78]GSE81608), sequence quality was assessed
   using FastQC (v.0.12.1) and MultiQC (v.1.22.2). Fastp (v.0.23.4) was
   then employed to remove adapters and low-quality sequences. After
   quality control, cells with less than 85% of bases exceeding a Q30
   quality score were excluded. Specifically, 196 low-quality cells were
   filtered out from E-MTAB-5061 and 13 cells were removed from
   [79]GSE81608. Then, purity-filtered reads were aligned to the human
   GRCh38 reference genome using STAR (v.2.7.11a) with default parameters,
   and the number of read counts per gene was quantified with HTSeq
   (v.2.0.2). Aligned and preprocessed expression count table data from
   [80]GSE84133, [81]GSE81547, [82]GSE153855, [83]GSE101207,
   [84]GSE217837, and [85]GSE155698 were downloaded directly from NCBI
   GEO.

   The Seurat (v4.1.1) [[86]36] R package was used for data preprocessing,
   including filtering low-quality cells, normalization, scaling, and
   visualization of marker genes. The Seurat object was filtered to
   include only genes expressed in three or more cells and exclude
   low-quality cells (cells with genes < 200 and > 5,000, mitochondrial
   gene percentage > 25%). Gene expression was normalized for each cell by
   library size and log-transformed using a size factor of 10,000
   molecules per cell. Dimensionality reduction was performed by principal
   components analysis (PCA), and visualized with Uniform Manifold
   Approximation and Projection (UMAP). Cell clusters were identified
   using FindClusters function (based on the top 15 principal components),
   with the Louvain algorithm [[87]37] implemented in Seurat.

Feature selection and model construction for machine learning analysis

   In this study, to select features associated with the target variables,
   several commonly used feature selection algorithms were applied to
   calculate the importance scores and rankings of all features, including
   maximal information coefficient (MIC), analysis of variance (ANOVA),
   F-score, principal component analysis (PCA), gradient boosting decision
   tree (GBDT), and gini impurity [[88]38, [89]39]. Features were ranked
   according to their importance scores, and then the incremental feature
   selection (IFS) strategy was employed to identify the local optimal
   feature subset containing the top k features.

   In addition, several machine learning classification algorithms, namely
   support vector machine (SVM), random forest (RF), eXtreme Gradient
   Boosting (XGBoost), and logistic regression (LR) were included in this
   study. To evaluate the baseline performance of candidate algorithms,
   all classifiers were implemented using their default parameters. After
   obtaining the optimal model, hyperparameter tuning was implemented
   using the GridSearchCV function with 5-fold cross-validation (CV) to
   optimize the parameters for each classifier. For the multi-class
   classification problem of the pancreas cell subpopulation classifier,
   the one-vs-rest strategy was employed to the binary classification
   algorithm.

Reproducing existing cell-type prediction pipelines

   Four established cell-type annotation pipelines were reproduced for
   benchmarking against PanSubPred: caSTLe [[90]40], scID [[91]41], scPred
   [[92]42], and CellTypist [[93]43]. All pipelines were trained on the
   Baron dataset using the same training strategy as PanSubPred, including
   performance evaluation across two internal test cohorts: Baron test set
   and Baron T2D data, and three external test data: Segerstolpe external
   data, Enge external data, and Ngara external data. The caSTLe [[94]40]
   implementation followed its original GitHub repository
   ([95]https://github.com/yuvallb/CaSTLe), with Baron data as the source
   domain and all test cohorts as target domain. scID [[96]41] and scPred
   [[97]42] were reproduced using code from their respective repositories
   ([98]https://github.com/BatadaLab/scID and
   [99]https://github.com/powellgenomicslab/scPred), maintaining
   equivalent training-testing splits and preprocessing steps as
   PanSubPred. We incorporated two distinct aspects of using CellTypist
   for automated cell type annotation and benchmarking. First, we utilized
   the existing pre-trained model provided by the developers, which was
   trained on healthy adult human pancreatic islet cell data. For this, we
   employed the “Adult_Human_PancreaticIslet” model through the CellTypist
   online platform ([100]https://www.celltypist.org/) by sequentially
   uploading the test datasets in CSV format and selecting this
   pre-trained logistic regression–based model for annotation. Second, to
   ensure a fair comparison at the method level, we retrained CellTypist
   using our unified reference training dataset, thereby generating a new
   model based on the same logistic regression framework, which was
   implemented using the celltypist.train function in the celltypist
   Python package (v1.6.3). This retrained model was then applied to
   various internal and external test datasets to evaluate its performance
   relative to other methods under the same training conditions. For the
   other comparative methods (caSTLe, scID, and scPred), we similarly
   initialized and trained each tool using our reference training dataset
   according to their respective frameworks. This consistent training
   approach ensured that all methods learned from the same reference data,
   allowing for a fair and unbiased benchmarking of their cell type
   annotation performance on the test datasets.

Performance evaluation of machine learning model

   Serval classic metrics, including accuracy, precision, recall, and F1
   measure were used to evaluate the model performance [[101]44], which
   were defined as follows.
   graphic file with name d33e578.gif 1

   where TP, FP, TN, and FN denote the number of true positives, false
   positives, true negatives and false negatives, respectively.

   Additionally, the area under the receiver operating characteristic
   (ROC) curve (AUC) was also utilized to quantify the prediction
   performance of the model.

Pathway enrichment analysis and pathway scoring

   Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway analysis were performed using the
   clusterProfiler [[102]45] (v4.7.1.3) R package. GO enrichment analyses
   were conducted with the enrichGO function, and KEGG pathway enrichment
   was carried out using the enrichKEGG function, both with default
   settings. Enriched terms were ordered by statistical significance, and
   terms with a p-value < 0.05 was considered statistically significant.
   Pathway activity quantification at single-cell resolution was achieved
   through implementation of the Seurat AddModuleScore function, which
   calculates cell scores based on specific gene sets. This approach
   enabled us to quantify the activity of particular pathways or gene sets
   at the single-cell level, offering deeper insights into the functional
   significance of differentially expressed genes.

Pancreas intercellular communication analysis

   Pancreatic cell-cell communication networks were systematically
   analyzed using the CellCall R package (v1.0.7) [[103]46], the CellChat
   R package (v2.2.0) [[104]47], and the cellphonedb python package
   (v5.0.1) [[105]48]. The CellCall analysis included inferring
   intercellular communication scores, exploring pathway activity, and
   predicting the activity scores of downstream transcription factors
   (TFs). Two main metrics were utilized to quantify intercellular
   communication: (1) communication frequency: the total number of binary
   presence/absence of interaction between any two of the seven pancreatic
   cell types (e.g., beta-alpha and delta-alpha in ND) and (2)
   communication strength, defined as communication scores calculated for
   each cell-cell pair using the computational framework from [[106]46].

Pancreas cell subpopulation analysis

   Three different data were used to explored pancreatic heterogeneity.
   Initially, a separate Seurat object was created for each dataset, and
   low-quality cells were removed from each dataset individually.
   Subsequently, the three datasets were further normalized, merged, and
   batch-corrected using the merge functions from the Seurat (v4.1.1)
   package and RunHarmony functions from the Harmony (v.0.1.0) package
   [[107]49]. Downstream clustering analysis was then performed using the
   harmony-corrected components. The FindNeighbors function of Seurat with
   the first 20 principal components was used for the nearest-neighbor
   graph construction. For each cluster of subset cells, the
   FindAllMarkers function of Seurat package was utilized to find the
   differentially expressed positive genes (DEPGs) (> 25% cell detection;
   p.adj < 0.05 and logFC > 1). The pathway analysis was performed to
   explore the underlying biological functions of these DEPGs from
   different pancreatic cell clusters.

Pseudotime trajectory and RNA velocity analysis

   Pseudotime analysis was performed using the Monocle R package (v2.26.0)
   [[108]50]. The workflow included converting Seurat objects to Monocle
   objects while preserving highly variable genes and cell-type annotation
   information. Trajectory inference was conducted using Monocle’s DDRTree
   method for dimensionality reduction and trajectory visualization,
   enabling the exploration of cell differentiation trajectories. To
   perform RNA velocity analysis, we retrieved cellular identifiers,
   dimensionally reduced embeddings (PCA/UMAP), and associated metadata
   from the primary Seurat object. Spliced/unspliced counts were generated
   using Velocyto (v0.17.16) [[109]51] through alignment against the
   GRCh38 human reference genome. These matrices were then integrated with
   annotated single-cell transcriptomes via scVelo (v0.3.3) [[110]52] for
   dynamical modeling. Employing a stochastic inference framework, we
   computed RNA velocity vectors and projected them onto the embedding to
   map potential lineage trajectories among cell subclusters.

Results

Development and validation of PanSubPred for accurate pancreatic cell subtype
identification

   To systematically identify novel and robust cell-type-specific marker
   genes across seven major pancreatic cell populations, including both
   endocrine and exocrine lineages, we developed PanSubPred, an integrated
   computational framework tailored for pancreatic cell subtype
   annotation. Details on the development and validation of PanSubPred are
   provided in the Supplementary Text [111]S1 (Fig. [112]1). Using cell
   labels predicted by PanSubPred, we successfully distinguished cell
   subpopulations with distinct clustering patterns (Fig. [113]1B). Among
   the 64 marker genes, most were exocrine-related (Fig. [114]1C and D).
   Notably, 26 genes were previously reported markers, while 38 are novel
   candidates (Table [115]S6).

Fig. 1.

   [116]Fig. 1
   [117]Open in a new tab

   Construction and characterization of PanSubPred. A Summary of the
   PanSubPred method. B Clustering of gene expression profiles of
   pancreatic islet cells, using cell labels predicted by PanSubPred. C
   The heatmap of 64 genes in seven pancreatic cell types reveals the
   cell-type-specific genes (left). The normalized expression level for
   CELA3A, GC and COL6A2 genes for each cell projected onto UMAP
   coordinates (right). D The dot plot of 64 genes in different pancreatic
   cell subpopulations reveals the cell-type-specific genes. E Pathway
   enrichment analysis of 64 pancreas marker genes. Pathways were ordered
   by statistical significance (p-value). The Black dashed line represents
   the statistical significance threshold (p-value < 0.05)

PSC-Stat decodes stellate cell activation dynamics across the T2D-PDAC
pathological spectrum

   PSCs are central to pancreatic fibro-inflammatory remodeling, where
   their transition from qPSCs to aPSCs drives disease progression in both
   diabetes and pancreatic cancer [[118]22, [119]53]. Precise
   discrimination of these functionally distinct phenotypes is critical
   for identifying therapeutic targets against pathological stromal
   reprogramming. Among the 64 pancreatic cell marker genes identified in
   PanSubPred, we found that 14 of the 17 stellate-cell-specific genes,
   including both known markers and newly identified candidates, were
   differentially expressed between qPSCs and aPSCs (Fig. [120]2A). This
   distinct expression pattern suggested that these genes could serve as
   potential biomarkers for distinguishing between the two PSC phenotypes.
   To test this hypothesis, we developed PSC-Stat, a dedicated
   computational framework for PSC phenotype classification
   (Fig. [121]2B). During model construction, we trained and compared
   three machine learning models (SVM, RF, and XGBoost), with the
   XGBoost-based model achieving the best performance (AUC = 0.956) in
   5-fold CV (Fig. [122]2C). To further optimize the model, we applied
   three feature selection algorithms, MIC, ANOVA, and Gini, to evaluate
   the 17 stellate-cell-specific genes. The Gini method identified the top
   15 genes as the optimal feature set based on their importance scores
   (Fig. [123]2D and E). Notably, MMP14 ranked as the top-priority gene in
   feature selection, followed by TIMP1 and CRLF1, suggesting their
   pivotal roles in pancreatic stellate cell activation dynamics
   (Fig. [124]2E). This refined gene panel not only improved model
   performance, but also highlighted molecular features that may play a
   role in the transition from qPSCs to aPSCs, offering biologically
   meaningful clues into the activation process.

Fig. 2.

   [125]Fig. 2
   [126]Open in a new tab

   Construction and evaluation of PSC-Stat. A The violin plot of 17
   stellate-cell-specific genes showed that 14 genes of them were
   differentially expressed between qPSCs and aPSCs. The statistical
   significance was as such: ns: p-value > 0.05; *: p-value < 0.05; **:
   p-value < 0.01; ***: p-value < 0.001; ****: p-value < 0.0001, two-sided
   Wilcoxon rank-sum test. B Summary of the PSC-Stat framework. C The
   performance comparison of three classifier algorithms-based models with
   all 17 genes on 5-fold CV of training set. D The IFS curve of the 17
   genes from analysis of Gini, ANOVA, and MIC based on the XGBoost
   algorithm with 5-fold CV. E The importance scores of the top 15 genes
   based on gini impurity. F The ROC curve of PSC-Stat for different
   datasets. G Fang scRNA‑seq data projected on the Azimuth reference
   (top) and the corresponding cell type annotation prediction score
   displayed on dimensional reduction plot (bottom). H aPSC/qPSC ratio
   difference between ND (n = 9) and T2D (n = 4) individuals. two-sided
   Student’s t-test. I aPSC/qPSC ratio difference among ND (n = 9), T2D
   (n = 4) and PDAC (n = 14) individuals. two-sided Student’s t-test

   The optimized PSC-Stat achieved outstanding performance, with an AUC of
   0.963 in 5-fold CV and 0.978 on the independent test set
   (Fig. [127]2F), demonstrating its robustness and reliability. Notably,
   it accurately distinguished qPSCs/aPSCs in the T2D status with an AUC
   of 0.992 (Fig. [128]2F). To further assess the generalizability of
   PSC-stat, we validated it on three external independent cohorts: Fang
   (healthy and T2D individuals) [[129]32], Kang (healthy donors)
   [[130]16], and Steele (PDAC and adjacent tissues) [[131]33] (Fig.
   [132]S5A–C). Since these datasets lacked explicit PSC activation
   labels, we projected the scRNA-seq data onto a publicly available
   Azimuth integrated human pancreas reference to infer stellate cell
   phenotypes and obtain corresponding prediction scores (Fig. [133]2G,
   [134]S5D and E) [[135]16]. Cells with high-confidence (prediction
   score > 0.9) were retained for model evaluation. PSC-stat achieved
   consistent performance across all cohorts (Fig. [136]2F: AUC:
   Fang = 0.886, Kang = 0.835, Steele = 0.881), underscoring its robust
   predictive accuracy across diverse disease contexts. Together, these
   results validate PSC-stat as a powerful tool for precise classification
   of PSC state, providing a valuable resource for studying stellate cell
   dynamics in pancreatic physiology and pathology. PSC-Stat is freely
   available at [137]http://lin-group.cn/server/PSC-Stat/index.html.

   To investigate the pathological relationship between T2D and PDAC, we
   analyzed stellate cell activation dynamics across disease states.
   Strikingly, the aPSC/qPSC ratio exhibited a stepwise increase from
   healthy controls (ND) to T2D and further to PDAC (Fig. [138]2H-I, S5F,
   aPSC/qPSC ratio: ND: 1.44 ± 1.02, T2D: 4.72 ± 4.01, PDAC:
   18.67 ± 18.70). This cascade activation pattern aligns with clinical
   observations: epidemiological studies show that new-onset diabetes
   (NOD) patients have a 5- to 8-fold increased risk of being diagnosed
   with PDAC within 1 to 3 years [[139]54, [140]55], and approximately 50%
   of PDAC cases present with diabetic phenotypes at diagnosis [[141]56].
   Our findings provide single-cell evidence that progressive stellate
   cell activation may serve as a pathological nexus linking T2D to PDAC
   development. These results establish a cellular dynamics framework for
   understanding the T2D-PDAC continuum and suggest that monitoring aPSC
   ratios could serve as a novel biomarker for early PDAC detection in
   diabetic populations.

Rewired intercellular communication in T2D highlights ductal-centric
interaction hubs and a 15-gene diagnostic signature

   Recent studies have highlighted that coordinated intercellular
   communication is essential for maintaining pancreatic function and
   glucose homeostasis [[142]57]. To investigate how diabetes perturbs
   this coordination, we systematically compared cell-cell interaction
   networks between ND and T2D samples using CellCall [[143]46], a
   computational framework that integrates ligand-receptor (L-R)
   interactions and downstream TF activities through KEGG pathway-based
   L-R-TF axes. This method infers potential communication events by
   calculating the L2-norm of L-R co-expression (normalized via softmax
   function) combined with TF activity scores derived from regulon-target
   gene enrichment analysis.

   In normal pancreatic tissue, intercellular communication networks
   primarily centered on exocrine cells, with stellate and ductal cells
   acting as central signal integrators that coordinate signals from both
   endocrine and exocrine populations (Fig. [144]3A). In contrast, T2D
   patients exhibited significant remodeling of these networks: global
   interaction frequency (i.e., the number of cell-cell interactions)
   declined (e.g., complete loss of stellate (sender)-stellate (receiver),
   delta-alpha, and gamma-stellate interactions etc.), and signal strength
   (i.e., the communication score calculated by CellCall for each
   cell-cell interaction) between most cell pairs decreased, particularly
   the communication targeting stellate cells. This shifted the
   communication hub toward ductal cells, forming a new pattern where
   ductal cells become the primary signal recipients (Fig. [145]3A). To
   further validate this finding, we conducted comparative analyses using
   two additional cell-cell communication inference tools, CellChat and
   CellphoneDB. CellChat analysis confirmed that, compared to the ND
   group, both the number and the overall strength of inferred
   interactions were reduced in T2D (Fig. [146]S6A). Examining interaction
   numbers and strengths consistently showed that in the ND group, the
   majority of cell-cell interactions were organized around stellate and
   ductal cells serving as core signal integrators (Fig. [147]S6B, D). In
   contrast, in T2D samples, the communication network shifted to a ductal
   cell-centered pattern (Fig. [148]S6C, E). We further used CellChat’s
   comparative pipeline to visualize the differences in interaction
   numbers and strengths between ND and T2D using circle diagrams. The
   results once again supported the observation that in T2D, interactions
   directed toward ductal cells were notably enhanced, particularly
   signals from beta cells to ductal cells and from delta to ductal cells
   (Fig. [149]S6F and G). Additionally, we applied CellphoneDB as a second
   independent method to identify statistically significant
   ligand–receptor interactions in both ND and T2D groups, considering
   interactions with p-values < 0.05 as significant. Consistent with the
   previous analyses, we found that the number of significant interactions
   in T2D was lower than in ND. Moreover, in the ND group, significant
   interactions were predominantly concentrated around stellate and ductal
   cells, whereas in T2D they were mainly distributed in stellate and
   ductal cells, with ductal cells emerging as a key hub (Fig.
   [150]S6H-I).

Fig. 3.

   [151]Fig. 3
   [152]Open in a new tab

   Analysis of cell-cell communication among pancreatic cells in ND and
   T2D. A Circle plot of intercellular communication patterns among seven
   pancreatic cell subsets in ND (left) and T2D (right), respectively. The
   outermost ring colors represent the seven distinct cell types. The
   inner ring colors indicate whether the cell is acting as a sender
   (ligand-expressing, in red) or a receiver (receptor-expressing, in
   blue). Each arc represents an inferred communication from a
   ligand-expressing cell to a receptor-expressing cell, representing the
   inferred signal directionality. The color of the arc corresponds to the
   sender cell type, and the color gradient from light to dark reflects
   the strength of the communication score between the interacting cell
   pairs. B Sankey plot showing the ligand–receptor–transcription factor
   (L–R–TF) signaling cascade from beta cells (sender) to ductal cells
   (receiver) under T2D conditions, in which ligands are expressed in beta
   cells and both receptors and TFs are in ductal cells. The three columns
   from left to right represent ligands (sender), receptors (receiver),
   and TFs, respectively. And the coloring of the genes is only intended
   to improve readability. Each complete stream indicates a distinct
   signaling pathway involving a L–R–TF signaling. The color of the stream
   on the left and right sides are consistent with ligand and receptor
   respectively, allowing the flow of information through the pathway to
   be visually tracked. C The communication network from other cells to
   ductal cells in T2D patients. Select ligand-receptor axes with a
   communication score greater than 0.5 for visualization. D Venn diagram
   showing the overlap of target genes associated with 6 key TFs in ductal
   cells (TGs) and two disease-specific pathways (PGs). E Venn diagram
   showing the shared genes in ductal cells between the Fang et al.
   (discovery) and Segerstolpe et al. (validation) datasets. F The
   performance comparison of four classifier algorithms-based models with
   the shared 77 genes on training set. G The IFS curve of gene selection
   using LR and four feature selection methods on training set. The black
   dotted line represents the top 15 genes selected by GBDT. H The
   importance scores of the top 15 genes selected by GBDT. I The ROC curve
   of the model with the top 15 genes on training, internal testing and
   external validating Segerstolpe dataset

   Interestingly, despite the global attenuation of intercellular
   signaling, beta-ductal and delta-ductal communication scores were
   enhanced in T2D (Fig. [153]3A), potentially driven by ligand-receptor
   axes such as FGF7-FGFR2/3, EFNB3-EPHB2/4/6, and EFNA5-EPHA2
   (Fig. [154]3B and C, Fig. [155]S7A–C). Further analysis of these
   ligand-receptor-TF axes in T2D ductal cells identified six key TFs:
   ETS1, JUN, MYC, NFKB1, NFKBIA, and ABL1 (Fig. [156]3B, Fig.
   [157]S7D–F). Additionally, CellCall pathway analysis revealed
   T2D-specific activation of insulin signaling and axon guidance
   pathways, predominantly enriched in beta-ductal and delta-ductal
   interactions (Fig. [158]S7G-H), suggesting their roles in metabolic
   stress adaptation and neural remodeling during diabetes progression.

   To test whether these transcriptional and pathway alterations could
   serve as potential T2D-specific signatures, we developed a machine
   learning framework integrating TF-associated genes and pathway
   components. From the 1,472 target genes (TGs) of the six key TFs in
   ductal cells (Table [159]S8A) and 304 genes from T2D-enriched pathways
   (PGs) (Table [160]S8B), we identified 1,729 candidate genes, including
   47 shared genes (Fig. [161]3D). We then extracted their expression
   profiles from two independent ductal cell datasets, covering both ND
   and T2D samples. Due to the inherent sparsity in single-cell data, many
   genes are not detected in most cells; therefore, we excluded genes not
   expressed in more than 70% of cells. After filtering, 78 and 1,451
   genes remained in the two datasets, with 77 genes shared
   (Fig. [162]3E). The downstream analyses focused on this robust set of
   77 genes. The larger dataset was used as a discovery cohort, randomly
   split into training (80%) and testing (20%) sets (Table [163]S9). We
   evaluated multiple algorithms and found that LR-based model achieved
   the best performance (Fig. [164]3F). Using MIC, ANOVA, Gini, and GBDT
   for feature selection combined with LR, we selected the inflection
   point with top 15 GBDT-ranked genes as the optimal feature set
   (Fig. [165]3G-H). Parameter tuning based on these 15 genes yielded a
   model with good discrimination in both the training and internal test
   sets (Fig. [166]3I). External validation on the Segerstolpe dataset
   further confirmed its robustness (AUC = 0.846; Fig. [167]3I). These
   results indicate that integrating communication-related transcriptional
   hubs and pathway perturbations is a promising strategy for identifying
   T2D biomarkers mechanistically linked to disease progression. The
   consistent high AUC values across datasets support the notion that
   altered intercellular communication is a core feature of T2D and offers
   important insights into its pathogenesis.

   Among the 15 identified genes, the core gene RPL3 (Fig. [168]3H, ranked
   first in feature importance) is a downstream target of the
   transcription factor MYC. In addition, RPS4X, RPL13, RPL36, and ACTG1,
   part of this 15-gene panel, are also regulated by MYC. Studies have
   shown that MYC, a known recovery gene associated with apoptosis, also
   acts as a key regulator of cell proliferation, and its altered
   expression levels are linked to the development of T2D [[169]58].
   Furthermore, four of the 15 genes are downstream targets of ABL1, and
   another four are regulated by NFKB1. Notably, seven out of the 15
   genes, including ACTB (ranked third), S100A6 (ranked fourth), and KRT19
   (ranked fifth), are downstream targets of the transcription factor ETS1
   (Fig. [170]3H, Table [171]S8). This suggests that ETS1 may serve as a
   central hub in regulating the transcriptional network of pancreatic
   ductal cells, further supporting our hypothesis that
   communication-regulated transcription factors and their target genes
   can provide new molecular evidence and potential intervention targets
   for T2D progression. Collectively, these genes, orchestrated by TFs
   such as MYC, ETS1, and NFKB1, form a ductal cell-centered communication
   network, offering a new set of potential molecular biomarkers for
   diabetes.

A representative mature beta cell subset associated with insulin secretion
and mitochondrial metabolism is reduced in T2D

   Beta cell heterogeneity has been well-documented in both humans and
   mice, with distinct subpopulations identified based on functional,
   morphological, molecular, and maturation characteristics
   [[172]11–[173]13]. However, the relationship between beta-cell
   heterogeneity and beta-cell failure in T2D, as well as the specific
   roles of individual beta subpopulations in diabetes pathogenesis,
   remains poorly understood. To address this, we integrated single-cell
   transcriptomic data from Xin, Baron, and Segerstolpe with ND and T2D
   individuals. After quality control to remove low-quality cells, we
   constructed a joint embedding space consisting of 18 distinct clusters,
   irrespective of data source, donor origin, and disease status (Fig.
   [174]S8A). Upon annotating the major pancreatic cell subpopulations
   (Fig. [175]S8B–D), we extracted beta cells and performed Louvain
   clustering, identifying three distinct clusters that were present in
   both ND and T2D (Fig. [176]4A-B).

Fig. 4.

   [177]Fig. 4
   [178]Open in a new tab

   scRNA-seq analysis reveals changes in beta cell heterogeneity promoted
   by T2D. A Unsupervised clustering of single-cell transcriptome
   visualized with UMAP analysis. Data represent pancreatic islet cells
   from ND (n = 6,284 cells pooled from 19 donors) or T2D (n = 3,083 cells
   pooled from 9 donors). B Projection of beta cells (n = 2,793) from both
   ND and T2D using UMAP analysis. C Normalized expression values for key
   beta cell maturation, immaturity, insulin secretion, endoplasmic
   reticulum (ER) stress and ER-associated degradation (ERAD) compared
   among different beta cell clusters. D Cell scores for selected gene
   sets, including beta cell maturity, immaturity, insulin secretion, beta
   cell development, beta cell proliferation and response to ER stress.
   The central error bars in the violin plot indicate the mean and
   standard deviation (mean ± sd) of the data distribution. E Enrichment
   of GO terms ordered by statistical significance in cluster 0 beta
   enriched genes. Dashed line represents p-value < 0.05. F Frequency of
   beta cluster cells in ND and T2D donors, with cluster 0 (ND n = 19, T2D
   n = 9), cluster 1 (ND n = 7, T2D n = 7), and cluster 2 (ND n = 13, T2D
   n = 5). Data shown are the mean ± s.e.m. *p value < 0.05, two-sided
   Wilcoxon rank-sum test. G Significant differential enriched genes in
   each beta cluster of ND and T2D individuals. H Summarized map of beta
   cell heterogeneity and loss. Tp: transcriptional. I RNA-velocity
   analysis of ND and T2D INS/GCG, alpha-/beta-cells. Black streamline
   arrows represent predicted direction of cell state change and
   trajectories. Larger red arrows represent overall velocity for each
   area of the UMAP

   We conducted molecular profiling of the three distinct beta cell
   clusters (Fig. [179]S8D-E), identifying three distinct beta cell
   subpopulations: cluster 0 representing mature beta cells, cluster 1
   corresponding to immature or dedifferentiated beta cells, and cluster 2
   characterized by elevated expression of endoplasmic reticulum (ER)
   stress-related genes. The detailed molecular signatures of each cluster
   are discussed in Supplementary Text [180]S2 (Fig. [181]4C-E).

   To obtain a global landscape of beta cell heterogeneity, we compared
   the proportion differences of beta cells between ND and T2D
   individuals. We identified a significant reduction in mature beta cells
   (cluster 0) and concomitant expansion of immature beta cells (cluster
   1) in T2D pancreases, accompanied by an increased trend of ER
   stress-associated beta cell subpopulations, although this did not reach
   statistical significance (Fig. [182]4F). These observations suggest
   that the progression of diabetes may be associated with a gradual loss
   of beta cell function and the accumulation of immature beta cells,
   reflecting the decline in insulin secretion capacity and changes in
   beta cell responsiveness. Consistent with our findings, Hrovatin et al.
   [[183]59] constructed a comprehensive single-cell atlas of mouse islets
   spanning developmental and diabetic states, and revealed that beta
   cells in T2D genetic models exhibit downregulation of genes associated
   with maturity and insulin secretion (e.g., Ucn3, G6pc2), alongside
   upregulation of ER stress activity and enrichment of immature-markers.
   This shift in gene expression programs aligns well with the
   heterogeneity we observed in the distribution of beta cell
   subpopulations in human T2D. Further analysis revealed that in T2D,
   subpopulations representing mature beta cells (cluster 0) and less
   mature cells (cluster 1) exhibited more differentially expressed genes
   compared to ND (Fig. [184]4G), suggesting that diabetes progression may
   amplify the molecular differences between these subpopulations,
   increasing transcriptional output and noise. Such heightened
   transcriptional output resembles the dysregulated gene expression
   observed in aging beta cells reported by Shrestha et al. [[185]60],
   potentially explaining the recurrent identification of ER stress-linked
   beta cell subtypes. Regarding the potential mechanisms of beta cell
   loss, we hypothesize that cells initially activate compensatory
   responses, such as ER-associated degradation (ERAD), to alleviate ER
   stress by degrading misfolded proteins. If stress is resolved, cells
   restore homeostasis and resume normal insulin secretion. However,
   persistent ER stress that cannot be mitigated may eventually trigger
   apoptotic pathways, leading to beta cell loss (Fig. [186]4H).

Molecular and functional heterogeneity of non-beta pancreatic cells in T2D

   Beyond beta cells, significant molecular heterogeneity exists across
   pancreatic cell populations. Sub-clustering of acinar cells identified
   two functionally distinct subsets (Fig. [187]5A): cluster 0 exhibited
   high expression of digestive enzyme-related genes (e.g., CELA2A,
   CELA3A, CELA2B), while cluster 1 was enriched for immune and
   inflammatory markers, such as CD74, CCL2, CXCL17 (Fig. [188]5B),
   consistent with prior reports by Segerstolpe et al. [[189]19]. Notably,
   beyond immune and inflammatory signatures, subset 1 acinar cells also
   displayed distinct metabolic dysregulation, characterized by oxidative
   stress, xenobiotic metabolism, and hormonal modulation (Fig. [190]5D).
   Functional characterization identified cluster 0 as the dominant mature
   acinar population (70.3% in healthy controls), primarily responsible
   for normal function of acinar cell. In contrast, cluster 1 adopted a
   multifunctional state involving inflammatory modulation, metabolic
   stress adaptation. Critically, the proportional expansion of cluster 1
   in T2D individuals suggests its potential role in exacerbating
   pancreatic dysfunction through aberrant metabolic-immune crosstalk
   (Fig. [191]5C).

Fig. 5.

   [192]Fig. 5
   [193]Open in a new tab

   Functional and molecular heterogeneity in non-beta pancreatic cells. A
   Projection of two acinar subpopulations from ND and T2D using UMAP
   analysis. B The differentially expressed genes between acinar clusters
   0 and 1 indicate enrichment patterns, with upregulated genes
   representing those enriched in cluster 0 and downregulated genes
   corresponding to those enriched in cluster 1. Significance threshold:
   adjusted p-value < 0.05 and|logFC| >1. C Bar plot showing the
   proportion of each acinar 0/1 subpopulation in ND and T2D. Each
   condition is color-coded as indicated. D Enrichment of GO terms and
   associated genes significantly enriched in acinar clusters 0 and 1. E
   Projection of two ductal subpopulations from ND and T2D using UMAP
   analysis. F Bar plot showing the proportion of each ductal 0/1
   subpopulation in ND and T2D. Each condition is color-coded as
   indicated. G Genes with significantly differential expression between
   subclusters of ductal subsets, with upregulated genes representing
   those enriched in cluster 0 and downregulated genes corresponding to
   those enriched in cluster 1. Significance threshold: adjusted
   p-value < 0.05 and|logFC| >1. H Pathway enrichment analysis of genes
   significantly enriched in ductal cell clusters 0 and 1. Dashed line
   represents p-value < 0.05. I Projection of three alpha subpopulations
   from ND and T2D using UMAP analysis. J Normalized expression values for
   key alpha cell identity/function, mitochondrial-associated, and cell
   proliferation-associated genes compared among different alpha cell
   clusters. K Comparison of GO enrichment analysis for genes enriched in
   alpha cell clusters 0 and 2. Dashed line represents p-value < 0.05

   Sub-clustering analysis of ductal cells also identified two distinct
   subpopulations (Fig. [194]5E). Cluster 0 predominated in ND pancreases
   (70.5%), whereas subset 1 was markedly expanded in T2D individuals
   (Fig. [195]5F: 70.4%). Transcriptomic profiling revealed ductal 0 cell
   was enriched for microtubule-associated genes (including TUBA1C,
   TUBA1B, TUBB, TUBB4B etc.), involved in cytoskeletal organization and
   mitotic cell cycle regulation, while cluster 1 was enriched for genes
   associated with bile, gastric, pancreatic, and salivary secretion
   pathways (Fig. [196]5G and H). Consistent with our findings, Baron et
   al. previously reported ductal cell heterogeneity, identifying two
   major subpopulations characterized by high expression of MUC1 and CFTR,
   respectively [[197]21]. In our study, both MUC1 and CFTR were
   significantly enriched in cluster 1 (Fig. [198]5G).

   We conducted a subpopulation analysis of pancreatic alpha cells and
   identified three distinct alpha cell subtypes (Fig. [199]5I). The
   predominant cluster 0 (79.4% of alpha cells) exhibited marked
   enrichment of alpha cell identity genes (Fig. [200]5J: ARX, LOXL4,
   FEV), aligning with the electrophysiologically active mature alpha cell
   subtype reported by Camunas-Soler et al. [[201]61]. In stark contrast,
   the minor cluster 2 (0.5%) showed pronounced cell cycle gene
   activation, including CDK1, MKI67, CENPF etc. Consistent with findings
   from Segerstolpe et al. [[202]19], this rare alpha-cell subset may
   represent a highly proliferative population contributing to pancreatic
   tissue self-renewal. Additionally, cluster 1 (20.1%) demonstrated
   unique enrichment of mitochondrial-encoded genes, suggesting metabolic
   reprogramming may underlie its functional specialization
   (Fig. [203]5J). Functional enrichment analysis confirmed that cluster 0
   was associated with biological functions related to mature alpha-cell
   activity, including hormone secretion/transport/response and signal
   release. Meanwhile, cluster 2 was predominantly enriched in pathways
   related to cell division, proliferation, and cell cycle regulation
   (Fig. [204]5K). Similar to the heterogeneity observed in beta-cell
   maturation states, we speculate that alpha cell subpopulations may
   maintain pancreatic homeostasis through a mature-proliferative duality,
   suggesting a conserved mechanism for pancreatic endocrine homeostasis.

Discussion

   In this study, we developed PanSubPred, a novel pancreatic cell
   annotation tool based on human pancreatic scRNA-seq data, for accurate
   identification of endocrine and exocrine cell subtypes. The innovation
   of this tool is twofold. First, our machine learning framework,
   integrating multi-source data, successfully addresses cross-dataset
   generalization challenges, establishing a highly robust classification
   model. Second, the integration of complementary feature selection
   algorithms with ensemble learning strategies effectively captures both
   linear and nonlinear gene expression patterns, enabling robust marker
   discovery. Performance evaluation demonstrates that PanSubPred
   maintains encouraging classification accuracy across all pancreatic
   cell types, notably showing robustness in cross-datasets. Systematic
   analysis of canonical markers versus novel candidates revealed an
   interesting biological insight: While the six established markers
   (INS/GCG/SST/PPY/PRSS1/KRT19) contribute to satisfactory classification
   accuracy, the remaining 58 genes maintain 98.3–100% discriminative
   power. This finding holds dual significance. Methodologically, it
   validates our framework’s ability to extract subtle biological signals
   beyond canonical markers. Biologically, it suggests pancreatic cell
   identity may be regulated through coordinated expression programs
   involving both well-established markers and novel genes, opening new
   avenues for discovering diagnostic biomarkers and therapeutic targets.
   The characteristic co-expression patterns among these genes also
   indicate their participation in cell-type-specific regulatory networks.
   Beyond the six classical markers (INS/GCG/SST/PPY/PRSS1/KRT19), the
   gene panel used in PanSubPred includes 20 previously reported potential
   markers and 38 novel cell-specific markers identified in this study
   (Table [205]S6); the latter may provide new molecular tools for
   enrichment, purification, and functional analysis of specific
   pancreatic cell subpopulations.

   Moreover, we also recognize that for diseases closely related to
   pancreatic function, such as diabetes, the choice of reference atlas
   (e.g., using healthy vs. diseased samples) during cell annotation may
   influence the accuracy of label transfer in single-cell analyses. Most
   existing reference atlases or pre-trained models are constructed based
   on healthy tissue samples; however, in disease states like diabetes,
   the transcriptional features of pancreatic cells may undergo
   significant changes, such as beta cell dysfunction and abnormal
   dedifferentiation. These alterations pose challenges for accurate label
   transfer. Therefore, PanSubPred places particular emphasis on robust
   annotation and generalization across multi-source data and different
   disease states, aiming to minimize potential biases caused by
   pathological conditions. In the future, developing mixed or stratified
   reference atlases that incorporate both healthy and different
   disease-state samples may further enhance the accuracy and stability of
   cell subtype annotation, providing more reliable tools to support
   single-cell studies of complex diseases.

   Significant heterogeneity has been identified within classically
   defined pancreatic cell populations. Specifically, PSCs exhibit two
   activation-dependent subtypes: activated (aPSC) and quiescent (qPSC).
   Leveraging PSC-specific markers, we developed PSC-Stat, a novel tool
   for dynamically resolving PSC activation states. Notably, our analysis
   revealed an upregulated aPSC/qPSC ratio in both T2D and PDAC compared
   to controls. In T2D, persistent aPSC activation drives excessive
   extracellular matrix deposition, accelerating pancreatic fibrosis,
   which may be an important mechanism of beta cell failure
   [[206]62–[207]64]. More importantly, the progressive activation of
   aPSCs from ND to T2D and T2D to PDAC highlights a potential
   pathological link between diabetes and PDAC. This aligns with clinical
   evidence of an elevated PDAC risk in NOD patients [[208]54, [209]55],
   further supporting the notion that diabetes serves as the early warning
   for PDAC. Mechanistically, T2D-induced aPSC activation may remodel the
   pancreatic stromal microenvironment through secretion of pro-fibrotic
   factors and pro-inflammatory mediators, ultimately promoting PDAC
   pathogenesis [[210]65]. These findings suggest that monitoring PSC
   activation states via PSC-Stat could facilitate the development of
   targeted interventions against aPSCs. Such strategies could not only
   mitigate pancreatic fibrosis but also improve the islet
   microenvironment, offering a dual regulatory approach for both T2D and
   PDAC progression.

   Beta cell heterogeneity is a typical representation of pancreatic islet
   cell heterogeneity. To reproduce previously reported beta cell
   molecular subpopulations, we employed a classical heterogeneity
   analysis framework: identifying subpopulation structures through
   k-nearest neighbors (kNN)-based clustering, screening molecular
   differences via Wilcoxon rank-sum test, and characterizing functional
   features/cellular identities through biological pathway enrichment
   analysis [[211]11, [212]21]. In integrated human islet single-cell
   data, we identified a more mature, metabolically and functionally
   active beta cell subpopulation, as reported by Rubio-Navarro et al. in
   mouse beta cell studies [[213]11], which accounted for 77.8% of beta
   cells in ND and was significantly reduced in T2D. In contrast, an
   immature/proliferative/dedifferentiated beta cell subpopulation
   exhibited expansion in T2D. This maturity gradient-based classification
   of beta cells aligns with Bader et al.‘s findings in mice, where the
   human ortholog CFAP126 of murine Fltp serves as a mature marker for
   beta cell [[214]12]. Crucially, the well-documented beta cell loss in
   chronic T2D [[215]66] is reinterpreted through dual pathological
   trajectories in our study: selective depletion of functional beta cell
   pools concurrent with compensatory expansion of
   immature/transcriptionally unstable/stress-resilient subpopulations.
   This imbalance likely reflects an islet compensation-decompensation
   continuum that ultimately exacerbates beta cell failure in T2D.
   Furthermore, we associated the accumulation of the ER stress-associated
   subpopulation with the activation of beta cell apoptosis pathways,
   proposing a “stress-exhaustion” model that mechanistically explains the
   recurrent identification of ER stress-associated beta cell subtypes
   across different studies [[216]21]. In summary, our findings highlight
   that diabetes is characterized not only by changes in the proportions
   of beta cell subpopulations but also by profound disruptions in gene
   expression across these clusters. As the disease progresses, beta cells
   in diabetic patients show more marked instability and heterogeneity at
   the transcriptional level, which may underlie the gradual loss of
   pancreatic function.

   Comparative analysis of pancreatic cell communication networks between
   ND and T2D highlight significant topological reorganization. We
   observed a shift from a stellate-/ductal- central signal integrators in
   ND to a ductal-centric interaction hub in T2D, where ductal cells serve
   as the primary integrators of signals from multiple cell types. We
   propose that the transcriptional reprogramming of ductal cells in T2D
   exacerbates beta cell dysfunction through metabolic-inflammatory
   crosstalk. Specifically, NFKBIA, a key mediator of the beta-ductal
   communication cascade, serves as a crucial immune and inflammatory
   regulator whose expression has been associated with reduced beta-cell
   exocytosis in T2D [[217]61]. Meanwhile, NF-κB activation (via
   NFKB1/NFKBIA) in ductal cells drives the secretion of pro-inflammatory
   mediators such as IL-6 and CXCL8 [[218]67], which are known to induce
   beta-cell apoptosis and impair glucose-stimulated insulin secretion
   [[219]68, [220]69]. Collectively, these findings position ductal cells
   as potential active orchestrators of beta-cell failure in T2D, linking
   transcriptional rewiring to metabolic stress and inflammatory niche
   formation. Furthermore, through systematic analysis of signal
   transduction axes (receptor-ligand-TF-pathway-target gene), we
   identified a 15-gene signature in T2D ductal cells, which exhibited
   high AUC values across independent cohorts, demonstrating the utility
   of communication-driven analyses in potential markers discovery. Among
   these features, RPL3, RPS4X, RPL13, and RPL36 are ribosomal
   protein-related and are all downstream targets of MYC. MYC, a popular
   target in cancer research, orchestrates 3D chromatin structure by
   forming CTCF-mediated loops in cancer [[221]70]. Though the role of MYC
   in diabetic chromatin remodeling remains unexplored, emerging evidence
   suggests its potential to reconfigure enhancer-promoter interactions of
   metabolic genes under hyperglycemic stress. For instance, MYC interacts
   with lncRNA CCAT1 [[222]71], a regulator of glucose metabolism
   [[223]72]. These observations position MYC as a plausible mediator of
   3D genomic changes in diabetic ductal cells, warranting validation
   through HiChIP or ChIA-PET assays. Additionally, KRT19 and KRT7 are
   established ductal cell markers, while S100A6 and S100A10, both members
   of the S100 family, contribute to intracellular calcium signaling.
   Aberrant expression of S100A6 has been linked to T2D and pancreatic
   cancer, and it has been proposed as a potential diagnostic marker for
   PDAC [[224]73, [225]74]. Although these pancreatic ductal cell
   profile-based diagnostic markers have robust performance, we
   acknowledge the limited clinical feasibility of pancreatic biopsy as a
   routine diagnostic approach for diabetes. The invasive nature and
   associated risks of biopsy procedures, render this method impractical
   for widespread clinical adoption. Nonetheless, we believe that this
   communication-based perspective provides novel mechanistic insights
   into disease progression and may help identify novel promising targets
   for therapeutic intervention.

Conclusion

   This study establishes that pathological remodeling of pancreatic
   cellular heterogeneity is central to T2D progression. Through machine
   learning-enhanced single-cell analytics, we developed PanSubPred and
   PSC-stat frameworks to resolve seven pancreatic lineages and stellate
   cell activation states, identifying 38 novel markers and a
   diabetes-/PDAC-associated aPSC activation cascade. T2D rewired
   intercellular communication toward ductal hubs while depleting mature
   beta cells and expanding dysfunctional subtypes (immature
   CD81+/ER-stressed DDIT3+). Analysis of the variations of pancreas
   cellular heterogeneity in disease progression could improve our
   understanding of the disease molecular mechanism and might help to
   develop more potential anti-diabetic therapeutic schedules.

Electronic supplementary material

   Below is the link to the electronic supplementary material.
   [226]Supplementary Material 1^ (63.2KB, xlsx)
   [227]Supplementary Material 2^ (5.1MB, docx)

Acknowledgements