Graphical abstract
graphic file with name fx1.jpg
[39]Open in a new tab
Highlights
* •
Cells from a subset of autoantibody-positive donors are classified
as T1D
* •
A shared gene signature in distinct T1D-associated models across
cell types
* •
Machine learning and single-cell profiling can be used to model T1D
__________________________________________________________________
Despite progress in therapeutic approaches that could delay T1D onset,
early detection of this autoimmune disease remains challenging. Patil
et al. evaluate the utility of machine learning for early prediction of
T1D and demonstrate the feasibility of modeling of T1D based on
single-cell profiling of islets.
Introduction
Muscle and adipose tissues respond to insulin to increase glucose
uptake. Insulin is a hormone that is made by specialized cells called
beta cells positioned in the islets of Langerhans in the pancreas. In
the autoimmune disease type 1 diabetes (T1D), which arises from a
complicated interplay between genetic and environmental factors,
T cells attack and destroy beta cells. During the early stages of the
autoimmune process, autoantibodies (AAbs) against pancreatic islets can
frequently be detected in the serum, and the presence of multiple AAbs
is a strong predictor of T1D progression.[40]^1 Although the discovery
of insulin was a milestone in T1D research that made the survival of
millions of patients possible, insulin therapy fails to provide
complete protection against diabetes-associated complications. Recent
research has revealed various immune cell types and secreted cytokines
responsible for beta cell destruction.[41]^2 These findings have led to
the development of therapies to slow down or prevent T1D onset. For
example, blocking T cells using teplizumab was recently approved by the
FDA and has been reported to delay progression to clinical T1D in
high-risk participants by 2 years.[42]^3 Moreover, multiple clinical
trials including tumor necrosis factor-α (TNF-α) inhibition using
golimumab[43]^4^,[44]^5 or regulatory T cell-based therapies[45]^6 are
actively pursued as opportunities to delay T1D onset in at-risk
individuals. Despite these breakthroughs and the thrilling prospects
presented by ongoing immunotherapy trials for T1D, the unmet clinical
need is to reliably identify individuals fated to develop T1D at the
earliest possible stages and substantially delay or prevent the disease
onset.
In this work, we evaluated the feasibility of modeling early molecular
events in tissues relevant to the etiology of T1D using machine
learning and single-cell transcriptomic maps of individual cells from
pancreatic tissues. The JDRF-supported nPOD[46]^7 and the
NIDDK-supported HPAP consortia[47]^8^,[48]^9 are ongoing efforts,
collecting pancreatic tissues and immune-related organs from hundreds
of controls, non-diabetic but islet AAb-positive (AAb+), and T1D organ
donors. Among numerous genomics and molecular assays, the
revolutionizing single-cell transcriptomics (single-cell RNA sequencing
[scRNA-seq]) has become a standard technology to study T1D development.
The first series of human donor islets analyzed by our team at HPAP
released transcriptional profiles of islets in 24 non-diabetic control
(CTL), AAb+, and T1D donors across more than ∼80,000 cells.[49]^10 In
Fasolino et al., we reported a surprising correlation between the
expression level of around 1,000 genes in beta cells, but not any other
cell types, with anti-glutamic acid decarboxylase (GAD) AAb levels
detected in the serum of AAb+ donors, suggesting that the progression
of the autoimmunity process is reflected in the transcriptome of AAb+
beta cells. Despite the new insights gained from scRNA-seq profiling in
this study and other reports,[50]^11 many questions related to the
early molecular events leading to autoimmunity in T1D remain
unanswered. For example, it is not clear if there are consensus
transcriptional changes associated with T1D in different islet cell
populations across the human population. In addition, it remains
unknown whether there are any consensus transcriptional changes
associated with T1D progression that can be detected at early stages of
autoimmunity in AAb+ donors.
Although statistical strategies for differential gene expression
analysis have been developed to address such questions, the agreement
of differentially expressed genes identified through various approaches
is very low,[51]^12 and choosing the best approach to select
differentially expressed genes is
challenging.[52]^10^,[53]^13^,[54]^14^,[55]^15^,[56]^16^,[57]^17^,[58]^
18 Here, we aimed to model T1D progression using machine learning
approaches employing scRNA measurements from 50 organ donors, acquired
through the HPAP program. We reasoned that machine learning strategies,
which can learn patterns from data, may identify consensus changes in
gene expression associated with T1D for cells in pancreatic tissues at
prediabetic stages. A machine learning model is trained to perform a
task by receiving several examples of input data, such as gene counts
as features, with corresponding output labels, e.g., individual cells
labeled as T1D, AAb+, or CTL. The model then updates internal
parameters to enhance prediction accuracy. We devised a machine
learning classifier based on the extreme gradient boosting[59]^19
(XGBoost) algorithm and carried out classifications of single cells
across the three donor groups. Remarkably, we report that T1D can be
modeled by the XGBoost algorithm with high accuracy using solely islet
cell transcriptomic data from T1D and CTL donors. Interestingly, our
classifier reported T1D-like islet cells in a subset of AAb+ donors,
demonstrating that the transcriptional adaptations that occur in islets
of patients with T1D are already initiated in some AAb+ donors.
Considering the inaccessibility of the pancreatic tissues, our model
using single cells from islets cannot be used directly to predict early
stages of T1D in living individuals. Nonetheless, our study reports the
utility of machine learning algorithms in the early detection of
molecular changes in T1D using single-cell transcriptomics.
Results
scRNA-seq data in human pancreatic islets
We took advantage of our most recent release of scRNA-seq experiments
in the HPAP program across 50 donors in three groups, namely T1D (n =
9), AAb+ (n = 10), and CTL (n = 31) ([60]Figure 1A). The preprocessing
of scRNA-seq data included filtering low-quality cells, doublet
removal, and dimensionality reduction, similar to our previously
described protocol.[61]^20^,[62]^21 The total number of cells obtained
after processing the scRNA-seq data across all conditions was ∼169,000
([63]Figure 1B). Considering variability in tissue isolation and
surgical procedures, different numbers of cells were incorporated from
each donor across corresponding conditions ([64]Figures S1A–S1C). We
annotated individual cells based on the expression of known marker
genes using scSorter[65]^22 and reported the frequency of 10 different
cell types across conditions ([66]Figure 1C). Overall, acinar, alpha,
and beta cells were the largest cell populations with 43,401, 47,988,
and 36,837 cells, respectively ([67]Figure 1C). The percentage of
acinar and alpha cells was evenly distributed across different donor
groups ([68]Figure 1D). Expectedly, beta cells were significantly less
abundant in T1D donors compared to other donor groups, reflecting the
autoimmune destruction of this cell type ([69]Figures 1D and [70]S1D).
Conversely, ductal cells were more abundant in tissues collected from
T1D donors than the other two groups, reflecting the difficulty of
isolating pure islets from these donors ([71]Figure 1D). The cell-type
annotation and composition across different groups were also consistent
with previously published studies.[72]^10^,[73]^20^,[74]^21 The
expression of marker genes across major cell types such as acinar
(PRSS1), alpha (GCG), beta (INS), delta (SST), ductal (KRT19),
endothelial (VWF), epsilon (GHRL), immune (NCF2),pancreatic polypeptide
(PP) (PPY), and stellates (COL1A1) further corroborated cell
annotations across all the samples combined ([75]Figures 1E and
[76]S2). Together, we compiled high-quality transcriptional data
generated by HPAP and annotated major cell types in islets of three
donor groups.
Figure 1.
[77]Figure 1
[78]Open in a new tab
scRNA-seq reveals the cell populations of the human pancreatic islets
in CTL, AAb+, and T1D donors
(A) The complete workflow depicting the scRNA-seq and machine learning
workflow using human pancreatic islet tissue samples.
(B) Pie chart showing the number of cells and donor distribution across
different biological conditions.
(C) Uniform manifold approximation and projection (UMAP) plot showing
the scSorter cell classification of islet cells.
(D) Stacked bar chart showing the percentage-wise distribution of cell
types across AAb+, control, and T1D donors.
(E) Multiple feature plots UMAPs depicting the validation of
cell-type-specific expression of marker genes. Acinar cells (PRSS1
high), alpha cells (GCG high), beta cells (INS high), delta cells (SST
high), ductal cells (KRT19 high), endothelial cells (VWF high), epsilon
cells (GHRL high), immune (NCF2 high), PP cells (PPY high), and
stellate cells (COL1A1 high).
Performance of the machine learning model on scRNA-seq islet data
We aimed to devise three binary classifiers using single cells between
any pair of donor groups: (1) single cells from T1D vs. single cells
from AAb+ donors, (2) single cells from T1D vs. single cells from CTL
donors, and (3) single cells from AAb+ vs. single cells from CTL
donors. We followed two distinct strategies to build a model across
donor groups. In the first strategy, which we refer to as “unannotated”
classification, all cells from each donor group were combined and used
for training and testing purposes. We reasoned that this approach could
take advantage of all cells in our scRNA-seq measurements, improving
the performance of our classification. The strategy’s disadvantage is
the uneven number of cell types across different donor groups, making
the frequency of a particular cell type enriched in a class as the
driver of the training process. In the second strategy, which we refer
to as “annotated” classification, cells of each donor group with the
same annotation, e.g., alpha cells in AAb+ donors, were combined and
used for training and testing purposes, leading to the development of
one classifier per annotated cell type. The advantage of this approach
is that only changes in gene expression of the same cell type would be
used to classify cells from different disease groups. However, in this
strategy, subsets of cells are utilized for training and testing steps
compared to the unannotated approach, potentially influencing the
performance of our classification.
For both annotated and unannotated strategies, we divided individual
cells into training and testing groups and subjected the training data
to hyperparameter optimization (HPO) using the XGBoost algorithm. After
performing HPO using a 5-fold cross-validation procedure, we obtained
the optimal parameter set, which was used for training and testing the
final model. We performed 100 repetitions of the above procedure by
randomly shuffling the training data (i.e., random sampling without
replacement). In the unannotated XGBoost classifier built for all
cells, the T1D vs. AAb+ and T1D vs. CTL binary classifiers performed
exceptionally well, averaging ∼99% accuracy, ∼99% sensitivity, and ∼97%
specificity. The AAb+ vs. CTL classifier demonstrated an accuracy of
∼96% and a specificity of ∼88%. The relatively small decrease in
performance in the AAb+ vs. CTL comparison likely reflects the
similarity in transcriptional landscapes of single cells from AAb+ and
CTL donors ([79]Figure 2B; [80]Table S1). We also compared the
performance of XGBoost with other machine learning models such as a
support vector machine (SVM) with linear kernel, a SVM with radial
kernel, and naive Bayes methods across all these pairwise comparisons
and observed that XGBoost outperformed the other models
([81]Figure S3A; [82]Table S2).
Figure 2.
[83]Figure 2
[84]Open in a new tab
Classification performance of machine learning model on scRNA-seq islet
data
(A) A schematic workflow of XGBoost and performance. The
machine-learning-based XGBoost model was built for gene selection and
classification. The dotted lines show the training and testing
procedures, where T denotes the gradient boosting tree models. The
double lines show 100 repetitions of the entire workflow.
(B) Boxplots depicting a pairwise comparison of the XGBoost method
across all cells (unannotated) in the dataset.
(C) Performance of XGBoost across major cell types for T1D vs. CTL
comparison using boxplots.
(D) Performance of XGBoost across major cell types for T1D vs. AAb+
comparison using boxplots.
(E) Performance of XGBoost across major cell types for AAb+ vs. CTL
comparison using boxplots.
In the annotated classification between T1D and CTL groups built for
each annotated cell type, the XGBoost method performed exceptionally
well on our three metrics in the acinar, alpha, beta, and ductal cells.
However, binary classification using delta cells or immune cells, which
contained fewer cells compared to other cell types, demonstrated a
reduced performance ([85]Figure 2C). Similar results were observed in
the T1D vs. AAb+ and AAb+ vs. CTL comparisons ([86]Figures 2D and 2E).
Additionally, the average standard deviation in the comparisons of T1D
vs. CTL, T1D vs. AAb+, and AAb+ vs. CTL across 100 repetitions were
found to be very low (<1%), demonstrating the robustness and stability
of XGBoost models ([87]Table S1). Of note, the comparison of XGBoost
with SVMs or naive Bayes across annotated cells further demonstrated
the superiority of XGBoost models ([88]Figure S3B; [89]Table S2).
Taking these results altogether, the annotated XGBoost classifier
exhibited high performance across all cell-type comparisons.
Top-ranked genes selected from the machine learning model and pathway
enrichment analysis
A major reason for our choice of XGBoost over other machine learning
algorithms such as convolutional neural networks is the
interpretability and transparency in the XGBoost’s decision-making
process. In particular, XGBoost produces feature importance rankings,
allowing us to understand which features, i.e., genes, drove
predictions. In contrast, neural networks are often considered “black
boxes,” making it challenging to interpret their predictions.[90]^23 To
better understand which features drove the high-performance predictions
across single cells, we obtained the key gene signatures for each
comparison and used two strategies: (1) we ranked the lists of genes
based on the robust ranking algorithm (RRA) approach[91]^24
([92]Tables S3, [93]S4, and [94]S5) and (2) we examined the unranked
list of genes based on their selection frequency across 100 repetitions
([95]Tables S6, [96]S7, and [97]S8). These top-selected genes were used
for downstream pathway or protein-protein interaction (PPI) analysis to
further compare gene signatures associated with three clinical
conditions. The ranked list of genes with p < 0.05 based on the RRA
approach obtained from the unannotated T1D vs. CTL classifier was
enriched with genes annotated as “lipid mRNA metabolic process,”
“defense to external biotic,” and “antimicrobial” pathways
([98]Figure 3A). The 20 KEGG pathways (false discovery rate
[FDR] < 0.05) enriched within the top features (ranked genes with
p < 0.05) of unannotated and annotated T1D vs. CTL classifiers were
related to HIV and human papillomavirus infections in addition to
cytokine signaling ([99]Figures 3B, [100]S4, and [101]S5).
Figure 3.
[102]Figure 3
[103]Open in a new tab
Top-ranked genes selected from the machine learning model and pathway
enrichment analysis
(A) GO cluster analysis showing pathways based on genes obtained from
T1D vs. CTL comparison across unannotated all cells.
(B) Top 20 KEGG pathways based on ranked genes obtained from T1D vs.
CTL comparison across all cells (unannotated) and different cell types
(annotated) ([104]Table S3).
(C) UpSetR chart showing the common and unique count of genes across
annotated cells in T1D vs. CTL comparison.
(D) Pathways based on shared RRA_combined gene list in T1D vs. CTL
annotated cells.
(E) UpSetR chart showing the common and unique count of genes across
annotated cells in T1D vs. AAb+ comparison.
(F) UpSetR chart showing the common and unique count of genes across
annotated cells in AAb+ vs. CTL comparison.
We next aimed to assess whether there are consensus transcriptional
changes associated with T1D in different islet cell populations across
donors and evaluated commonality among top features (i.e., genes)
across different cell types. We applied the RRA approach again but now
on the ranked list of genes across cell types. This strategy can
produce reliable and consistent rankings even in the presence of noisy
or incomplete data. In the T1D vs. CTL classification, while more than
100 genes were uniquely detected in classifiers built on each annotated
cell type such as acinar, alpha, ductal, and beta cells, 66 genes were
common across classifiers of all annotated cell types based on their
high RRA scores, suggesting that changes in the transcriptional outputs
of these genes occur across T1D islets independent of their cellular
ontogeny ([105]Figure 3C). In particular, “neutrophil degranulation,”
“ER-phagosome pathway,” and “pancreatic secretion” were enriched within
this set of 66 common genes associated with models based on annotated
cell types ([106]Figure 3D). Moreover, in AAb+ vs. CTL and T1D vs. AAb+
classifiers, both common and unique genes across different annotated
cell types were detected, suggesting the relevance of multiple pathways
to changes in cells from AAb+ donors ([107]Figures 3E and 3F). Hence,
despite a clear manifestation of autoimmunity associated with beta
cells, our modeling strategy reports shared changes in transcriptional
landscapes of distinct cell types of islets. Together, the XGBoost
classifiers based on training the model using single cells grouped as
different cell types revealed the link between multiple genes and
pathways associated with T1D and AAb positivity.
One reason for choosing the XGBoost classifier in our modeling strategy
was to extract top features associated with T1D across different cell
types. Although statistical approaches performing differential
expression (DE) analysis also aim to determine genes with different
expression between disease groups, there are numerous concerns related
to these approaches, which are described in the [108]discussion.
Nonetheless, we also followed two DE analysis approaches as
complementary strategies: (1) DE gene lists obtained from individual
cells in pairwise comparisons of different donor groups using Wilcoxon
rank-sum tests ([109]Tables S9, [110]S10, and [111]S11) and (2) DE gene
lists obtained by performing individual donor-wise pseudobulk analysis
using DESeq2 ([112]Tables S12, [113]S13, and [114]S14). There was no
consensus on these two strategies. Single-cell-based DE analysis
reports too many genes to be differentially expressed (∼11,000) and
pseudobulk-based DE analysis reports too few genes (∼10) to be
differentially expressed. Hence, the DE-based approaches fail to
reliably detect disease-associated genes.
Expression of HLA class I genes in beta cells across healthy and T1D donors
We next focused on beta cell classifiers within all three groups and
focused on genes with selection frequencies higher than 50% across 100
iterations following the unranked gene selection approach. In
particular, we examined the enrichment of highly frequent genes as top
features within KEGG pathways ([115]Figures 4A and 4B). Among the top
10 significant pathways (FDR < 0.05), “type 1 diabetes mellitus” and
“antigen processing” were found among the highly frequent features of
T1D vs. CTL and T1D vs. AAb+ classifiers for beta cells
([116]Tables S15 and [117]S16). The specific genes involved in T1D and
antigen processing and presentation pathways were predominantly from
HLA class I, i.e., HLA-A, HLA-B, HLA-C, and HLA-E. Additionally, these
genes were detected across all 100 repetitions of modeling, suggesting
that HLA class I genes were reproducible top features across donors.
Figure 4.
[118]Figure 4
[119]Open in a new tab
Expression of HLA-1 genes in beta cells across healthy and T1D donors
(A) Comparison of significant KEGG pathways (FDR < 0.05) obtained from
different pairwise classifiers.
(B) Significant KEGG pathways for T1D vs. CTL (FDR < 0.05).
(C–E) The top 3 modules were obtained from the PPI network using the
MCODE algorithm.
(F) Average expression of beta cells in non-diabetic controls and T1D
donors.
Next, we created a PPI network map with significant genes as nodes and
their connections as edges. We further applied the MCODE
algorithm[120]^25 on the PPI network map and obtained three clusters or
key modules ([121]Figure 4C). All the HLA class I proteins were grouped
into one cluster that was significant (p < 0.05). In the beta cell
classification, genes encoding the HLA class I proteins were
significantly upregulated in T1D donors compared to CTL donors
([122]Figure 4F). Additional significant clusters contained genes
important for mitochondrial function and transcriptional elongation.
Next, we analyzed the expression of the HLA class I genes in CTL and
T1D beta cells individually ([123]Figure 4F). The dots inside the
violin plots represent individual donors, where the cell-level gene
counts were aggregated into pseudobulk counts. Strikingly, the HLA
class I genes were upregulated within the few remaining beta cells from
T1D donors ([124]Figure 4F). Modeling differences in beta cells of T1D
and CTL donors suggest that small residues of beta cells in T1D donors
express high levels of HLA class I genes. These results are in
agreement with prior findings of increased HLA class I expression in
islets from patients with T1D obtained using antibody staining.[125]^26
We followed a comprehensive and robust resampling approach with over
100 iterations and additional cross-validations. To ensure an unbiased
modeling strategy, the test datasets from the 100 iterations in the
outer loop were independent of the training set. Nevertheless, we
generated scRNA-seq data in new sets of T1D and CTL organ donors and
assessed the significance of predicted genes in this confirmatory donor
cohort that were not utilized for testing or training ([126]Table S17).
We first performed gene set enrichment analysis and examined the
enrichment of the top features predicted by our model in up- or
down-regulated genes in two T1D donors compared to two CTL donors in
this confirmatory cohort. Remarkably, we found that the top predicted
genes based on our model were significantly enriched across major cell
types such as alpha, immune, and ductal cells in T1D donors
([127]Figure S6A). Moreover, HLA class I genes in addition to other top
features predicted by our model demonstrated increased expression
levels in T1D, but not CTL, organ donors in this confirmatory cohort
([128]Figure S6B).
Prediction of AAb+ cells using classification models from annotated cells in
T1D and CTL donors
One key goal in this study was to evaluate whether any subset of islet
cells from AAb+ organ donors demonstrate transcriptional similarity to
those from T1D individuals and, if such a similarity can be modeled,
the expression of which genes can classify a single AAb+ cell as a T1D
cell. Hence, for each annotated cell type, we used the trained T1D vs.
CTL classifier, where cells from T1D donors were labeled as class 1 and
cells from CTL donors were labeled as class 0. We tested how this model
predicted the class of AAb+ cells. Using the probability scores
obtained for each individual cell from the AAb+ donor group, we
determined the predicted class of a cell where a probability of >0.5 is
classified as T1D and less than 0.5 as CTL. Although 90% of cells from
AAb+ donors were predicted to be non-diabetic (class 0), around 10% of
cells were classified as T1D across different cell types (class 1)
([129]Figure 5A). Importantly, these “T1D-like” cells were not present
at uniform abundance among all organ donors but were highly enriched
among specific donors, in particular donors labeled as HPAP092 and
HPAP107 ([130]Figure 5A). Strikingly, the beta cells classified as T1D
from AAb+ donors had transcriptomic signatures of HLA class I genes
extremely similar to the T1D group ([131]Figure 5B). We also compared
the expression of alpha cells from AAb+ donors classified as T1D and
observed similar results, where HLA class I genes were upregulated in
those AAb+ cells that were predicted as T1D ([132]Figure 5C). The
pancreatic alpha cells are known to have a key role in the development
of diabetes mellitus.[133]^27^,[134]^28 It has been reported that the
alpha cells from donors with recent-onset T1D demonstrate reduced
glucagon secretion and dysregulated gene expression.[135]^29 Another
study using immunofluorescence staining showed that the majority of HLA
class I genes are expressed on pancreatic alpha cells and are
particularly hyperexpressed in the T1D group.[136]^30 We further
combined all the cells from the two AAb+ donors that were majorly
predicted as T1D and compared their HLA class I genes between all
groups. We confirmed that the HLA-A, HLA-B, HLA-C, and HLA-E genes were
highly upregulated in cells from AAb+ donors predicted as T1D, which
might reflect that autoimmunity had already progressed in these AAb+
individuals ([137]Figure 5D). Projection of cells from these donors
highlights their distribution across the uniform manifold approximation
(UMAP) ([138]Figures S7A and B). In addition, we created a module gene
score for HLA class I genes and observed a similar enrichment of this
pathway in AAb+ and T1D donor groups compared to CTL donors
([139]Figure S7C). Of note, HPAP107 is an organ donor expressing both
GAD and IA-2 AAbs, indicative of further disease progression, while
HPAP092 is a single GAD AAb+ donor. Together, our modeling strategy
discovered that a subset of cells in specific AAb+ organ donors have
transcriptional patterns resembling those typically found in islets of
T1D donors.
Figure 5.
[140]Figure 5
[141]Open in a new tab
Prediction of AAb+ cells using trained T1D-CTL classifier across major
cell types
(A) Distribution of cells misclassified as T1D in different cell types.
(B) Comparing the average expression of HLA-I genes among beta cells
from AAb+ donors classified as T1D with other conditions.
(C) Comparing the average expression of HLA-I genes among alpha cells
from AAb+ donors classified as T1D with other conditions.
(D) Comparing the average expression of HLA-I genes among all cell
types from two AAb+ donors classified as T1D with other conditions.
Prediction of AAb+ cells using classification models from unannotated cells
in T1D and CTL donors
We next compared transcriptomic differences and similarities between
T1D, AAb+, and CTL groups using our pretrained unannotated T1D-CTL
classifier. Among ten AAb+ donors, different percentages of cells from
six AAb+ donors were classified as T1D, with the majority belonging to
the HPAP092 donor ([142]Figure 6A). We next sought to evaluate
transcriptional profiles of the AAb+ cells that were predicted as T1D.
To provide a reference for these T1D-like cells, we first focused on
significant genes obtained in unannotated T1D vs. CTL, T1D vs. AAb+,
and AAb+ vs. CTL classifiers ([143]Tables S6, [144]S7, and [145]S8). We
checked the gene selection frequency of HLA class I, HLA class II, and
non-HLA genes across different unannotated classifiers
([146]Figure 6B). The selection frequency of HLA class I genes was
higher compared to some of the HLA class II genes. In addition, we also
checked the selection frequency of non-HLA genes including INS, IL32,
TNFAIP3, and LMO7 that have been associated with T1D
pathology.[147]^31^,[148]^32 The expression of HLA class I and II genes
among all the AAb+ cells from HPAP092 classified as T1D had a similar
expression pattern to the T1D group ([149]Figures 6D and 6E). Previous
studies have shown the inherited risk for T1D to be largely determined
by specific HLA-DQA1, HLA-DQB1, HLA-DRA, HLA-DPA1, and HLA-DRB1
alleles.[150]^31^,[151]^33^,[152]^34 The expression of HLA-DPB1 was
found to be higher in endocrine cell types of single-cell islet data
from diabetic individuals.[153]^10 We also observed a down-regulation
of the INS gene in AAb+ cells (HPAP092), which was similar to the T1D
group ([154]Figure 6F). Moreover, the IL32, TNFAIP3, and LMO7 genes
were found to be upregulated in AAb+ cells predicted to be in the T1D
class (HPAP092). Previously, an association to T1D had been reported
for these non-HLA genes (INS,[155]^31^,[156]^32
TNFAIP3,[157]^32^,[158]^35 LMO7,[159]^32^,[160]^36 and IL32[161]^37).
The similarities between the transcript levels of the HLA class I
([162]Figure 6D), HLA class II ([163]Figure 6E), and non-HLA genes
([164]Figure 6F) observed between AAb+ cells predicted as T1D and the
actual T1D group suggest that in specific AAb+ organ donors, the
pathogenic process toward T1D had progressed further than that in
typical AAb+ individuals. Altogether, modeling of the transcriptomic
differences between islets cells from T1D and CTL donors using either
an annotated or an unannotated classifier revealed individual cells
from AAb+ donors with transcriptional similarity to those from T1D
donors.
Figure 6.
[165]Figure 6
[166]Open in a new tab
Prediction of AAb+ cells using trained T1D-CTL classifier across all
cells together
(A) Distribution of AAb+ cells predicted as T1D using trained T1D-CTL
classifier for all cells.
(B) Selection frequency of genes from HLA-I and HLA-II class.
(C) Selection frequency of genes from non-HLA class relevant to T1D.
(D) Comparing the average expression of HLA-I genes among all cells
from HPAP092 AAb+ donor classified as T1D.
(E) Comparing the average expression of HLA-II genes among all cells
from HPAP092 AAb+ donor classified as T1D.
(F) Comparing the average expression of non-HLA genes among all cells
from HPAP092 AAb+ donor classified as T1D.
Donor-wise classification using the LOOCV strategy
To evaluate the performance accuracy per donor, we applied a unique
splitting criterion for testing and training purposes and implemented
the leave-one-out cross-validation (LOOCV) strategy. We trained the
model based on cells from all donors except one and tested the model’s
performance on the one donor left out of the training step. This
process was repeated across all donors. Remarkably, in this analysis,
the AAb+ donor HPAP092 was classified as T1D across all the annotated
cell-type comparisons ([167]Figure S8). For evaluation purposes, we
considered LogNormalize data (RNA assay) in addition to the default
SCTransform (SCT assay) data. Similar results were observed on the RNA
assay where the AAb+ donor HPAP092 was classified as T1D across all the
annotated cell types ([168]Figure S9). These results from the LOOCV
strategy confirmed of our previous observations ([169]Figure 5A). The
mean classification accuracy values for each annotated cell type using
the LOOCV strategy were between 80% and 90% for most cell types except
∼72% for the beta cell classifier in both the SCT assay
([170]Figure S10) and RNA assay ([171]Figure S11). This is likely
caused by the very few beta cells remaining in several of the donors;
for instance, the islets recovered from donors HPAP021 and HPAP022 had
only two beta cells each among the total 4,410 and 864 cells analyzed,
respectively. This low beta cell count in a subset of T1D donors led to
lower overall accuracy by the annotated beta cell classifier of T1D vs.
AAb+ using LOOCV. In contrast, when we had employed the annotated beta
cell classifier for T1D vs. AAb+ without the LOOCV strategy, all the
beta cells were pooled together, leading to high prediction accuracy.
Expression of CXCL8 gene in cell types across healthy, AAb+, and T1D donors
Another key finding based on top predicted features relates to the
expression of CXCL8, which is commonly known to be involved in the
immune system’s response to inflammation. We observed that among
cytokines and chemokines, CXCL8 was the only cytokine other than IL32
that was selected across all training instances for the T1D-CTL
comparison ([172]Figure S12A). Next, we checked the selection of CXCL8
across classifiers for annotated cell types and found that among
T1D-AAb+ classifiers, CXCL8 was selected majorly in ductal and immune
cell types ([173]Figure S12B). We also compared the expression of CXCL8
across individual donors to understand the expression at the donor
level. Strikingly, CXCL8 was highly expressed only in one AAb+ donor
(HPAP092), who was classified as a T1D donor based on our machine
learning approach ([174]Figures S12C and S12F). Evaluating the
expression abundance at the cell-type level across conditions
demonstrated that ductal, immune, and stellate cells had similar
expression patterns for CXCL8 in AAb+ and T1D donors, with the
percentage of cells expressing CXCL8 being the highest in T1D (∼55%)
and AAb+ donors (∼42%) compared to CTL donors (∼22%)
([175]Figures S12D–S12E). Lastly, we measured the expression of CXCL8
in our independent cohort and observed higher expression in ductal
cells of the T1D group ([176]Figure S6C). Together, the results of our
machine learning strategy pinpointed changes in gene expression in T1D
and AAb cohorts that were reproducibly detected across donors at the
single-cell level.
Discussion
Although substantial progress has been made over the past decades in
our understanding of the alterations in the pancreas of patients with
T1D, the underlying molecular processes of disease progression from
healthy to AAb positivity to T1D remain to be elucidated fully. Using
scRNA-seq islet data from 50+ human organ donors, we performed a
comprehensive analysis by applying a machine learning approach on the
large islet gene expression data from T1D, AAb+, and CTL individuals to
understand the disease progression at the single-cell level.
There are several tools useful for a broad range of concepts in the
single-cell field; however, some of the key measurements, such as DE
analysis, remain challenging. Single-cell data are often sparse,
heterogeneous, and multidimensional in nature, and it becomes
challenging to perform differential state analysis. There is a high
level of noise and dropouts (zero values),[177]^38 and the data also
encompass a large amount of biological variability.[178]^39 At present,
there are three approaches to performing DE analysis in scRNA-seq data:
(1) the individual cell approach where cell-level DE measurements are
performed using a negative binomial generalized linear model or
Wilcoxon rank-sum tests, (2) the pseudobulk approach where the
cell-level counts are aggregated into pseudobulk counts for DE analysis
using bulk RNA-seq tools such as edgeR and DESeq2, and (3) mixed
modeling with random effects where sample-level inferences are
considered. Overall, the individual cell approach developed for
scRNA-seq was outperformed by pseudobulk and mixed-modeling
approaches[179]^40 through a well-controlled FDR. However, these two
approaches reveal a lack of consensus for DE analysis.[180]^10 While
the computational time required for analyzing data through the
mixed-modeling approach is understood to be extremely high compared to
pseudobulk even in a down-sampled dataset,[181]^17^,[182]^18^,[183]^40
Zimmerman et al.[184]^41 described the pseudobulk approach as
conservative, where many DE genes were not detected. Additionally, the
agreement of DE genes identified through various approaches was very
low.[185]^12 Hence, choosing the best approach to select DE genes
remains challenging.[186]^10
In this study, we focused on the XGBoost method to achieve insights
into the prediabetic and diabetic disease states of pancreatic islets.
Sparse read counts are a main characteristic of scRNA-seq data, which
is important to consider when performing differential gene expression
analysis. The superior performance and robustness of the XGBoost method
on high-throughput gene expression
studies[187]^42^,[188]^43^,[189]^44^,[190]^45 led to its increased
popularity in single-cell
biology,[191]^46^,[192]^47^,[193]^48^,[194]^49^,[195]^50 and this
approach remains more powerful compared to other machine learning
approaches, including neural networks.[196]^50 To the best of our
knowledge, there is no literature using the machine learning approach
to identify gene signatures and classification of cells into relevant
disease states on single-cell data from T1D donors. We demonstrate
excellent performance of XGBoost classifiers built for T1D vs. CTL, T1D
vs. AAb+, and AAb+ vs. CTL comparisons across all cells and individual
cell types compared to other classifiers such as linear or radial SVM
and naive Bayes methods. The average accuracy, sensitivity, and
specificity obtained from 100 repetitions were found to be higher in
the T1D vs. CTL comparison compared to the AAb+ vs. CTL one, likely
because of the high similarity of the latter two states. In addition,
the top-selected gene lists obtained for each of these comparisons were
found to be upregulated in several key pathways, such as the T1D and
antigen processing and presentation pathways, with special enrichment
seen in beta cells. The genes involved in these pathways belonged to
HLA class I; therefore, we measured their individual expression in beta
cells and found that they were more highly expressed in T1D donors than
in CTL donors. We also evaluated the expression of these genes in an
independent cohort and found that the expression levels were indeed
consistently upregulated in T1D donors compared to CTL donors.
Increased expression of HLA class I genes in T1D has been reported in
the past. Benkahla et al.[197]^30 reported hyperexpression of HLA class
I genes in T1D donors through immunofluorescence staining and
microscopic image analysis. Hamilton-Williams et al.[198]^51 used the
non-obese diabetic mice model and showed that hyperglycemia was
observed in those mice that exhibited higher major histocompatibility
complex class I expression in beta cells. Richardson et al.[199]^52
performed enteroviral capsid protein vp1 staining on a large cohort of
neonatal, pediatric control, and T1D groups and observed
hyperexpression only in T1D donors. Nejentsev et al.[200]^53 reported
the contribution of HLA-A, HLA-B, and HLA-C toward T1D. In contrast to
these studies, Skog et al.[201]^54 performed staining through
immunohistochemical staining to measure protein expression and RNA-seq
to measure mRNA expression in non-diabetic controls and patients with
T1D; however, they reported no changes in the HLA class I genes across
these groups. Using imaging mass cytometry, Wang et al. found HLA-A,
-B, and -C expression to be upregulated in islets from short-, but not
long-, duration T1D and also overexpressed in beta cells in very recent
onset disease.[202]^55 Detecting this gene signature in T1D-like AAb+
cells in our study further demonstrates the upregulation of this
pathway at the early stages of autoimmunity.
A surprising discovery from this study is the observation that a subset
of non-diabetic donors positive for islet AAbs contain significant
numbers of pancreatic cells with gene expression profiles that do not
resemble AAb−, non-diabetic controls as expected but rather those
present in the pancreas of T1D individuals. It is tempting to speculate
that those AAb+ individuals with a large proportion of T1D-like
pancreatic cells would have been the ones to progress to the diabetic
state more rapidly than those in whom these cells are not present.
Another surprising finding from our study was the consistent
upregulation of CXCL8 in these AAb+ and T1D donors, especially in
ductal cells as predicted by our machine learning models. Only few
studies discussed the role of CXCL8 in T1D,[203]^56^,[204]^57 with the
primary factors of T1D disease being immune dysregulation and
inflammation, where several cytokines and chemokines contribute to the
inflammatory process; we speculate that CXCL8 would be a potential
biomarker in the AAb+ and T1D conditions. Unfortunately, it is
impossible to test this hypothesis due to the cross-sectional nature of
our study and the fact that the human pancreas cannot be biopsied
safely. Nevertheless, this finding provides strong evidence that the
transcriptomic changes that occur during the pathogenesis of T1D are
not simply a consequence of the hyperglycemic state; rather, they
appear to be an integral part of disease progression. Our future
studies will focus on examining the utility of machine learning
approaches in peripheral blood collected from T1D individuals.
Limitations of the study
In this study, we used scRNA-seq data and applied machine learning
approaches to train our models. The machine learning models required
large computational resources to train because of the high-dimensional
nature of single-cell transcriptomic data. Another limitation is that
the biomarkers identified could not be tested due to the
cross-sectional nature of our study since human pancreas cannot be
biopsied safely.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Critical commercial assays
__________________________________________________________________
Chromium Single Cell 30 Reagent Kits 10X Genomics N/A
__________________________________________________________________
Deposited data
__________________________________________________________________
scRNA-seq data Patil et al.[205]^20 [206]https://hpap.pmacs.upenn.edu/
__________________________________________________________________
Software and algorithms
__________________________________________________________________
Cell Ranger v3.0.1 10X Genomics
[207]https://www.10xgenomics.com/support/software/cell-ranger/latest
R R Development Core Team, 2008 [208]https://www.r-project.org/
R Studio N/A [209]https://www.rstudio.com/
xgboost v1.5.0.2 Chen et al.[210]^19
[211]https://cran.r-project.org/web/packages/xgboost/index.html
scSorter v0.0.2 Guo et al.[212]^22
[213]https://cran.r-project.org/web/packages/scSorter/vignettes/scSorte
r.html
Seurat v4.1.0 Hao et al.[214]^58 [215]https://satijalab.org/seurat/
scDblFinder v1.8.0 Germain et al.[216]^59
[217]https://bioconductor.org/packages/release/bioc/html/scDblFinder.ht
ml
SingleCellExperiment v1.16.0 Amezquita et al.[218]^60
[219]https://bioconductor.org/packages/release/bioc/html/SingleCellExpe
riment.html
sctransform v0.3.3 Hafemeister et al.[220]^61
[221]https://satijalab.org/seurat/articles/sctransform_vignette.html
This study Patil et al.[222]^20
[223]https://github.com/AbhijeetRPatil/ML_Islets)
[224]Open in a new tab
Resource availability
Lead contact
Correspondence and requests for materials should be addressed to Lead
Contact, Golnaz Vahedi (vahedi@pennmedicine.upenn.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
The pancreatic islet sequencing data and processed scRNA-seq Seurat
object can be found in PANCDB
([225]https://hpap.pmacs.upenn.edu/analysis). The scripts used for
scRNA-seq data processing and machine learning modeling are available
on GitHub ([226]https://github.com/AbhijeetRPatil/ML_Islets). Any
additional information required to reanalyze the data reported in this
work paper is available from the [227]lead contact upon request.
Experimental model and subject details
Our study does not include animals, plants, microbe strains, cell
lines, primary cell cultures. Pancreatic islets were procured by the
HPAP consortium (RRID:[228]SCR_016202;
[229]https://hpap.pmacs.upenn.edu), part of the Human Islet Research
Network ([230]https://hirnetwork.org/), with approval from the
University of Florida Institutional ReviewBoard (IRB # 201600029) and
the United Network for Organ Sharing (UNOS).
Method details
We present an overview of the ML-based XGBoost approach for the
classification of pancreatic scRNA-seq islet data from different
conditions ([231]Figure 1A). The complete process included the
procurement of human islet tissues, the preparation of a single-cell
suspension and 10x Genomics sample processing.
We used R programming language[232]^62 to perform all the computations,
including data pre-processing, ML model building, and downstream
calculations and visualizations. The caret,[233]^63 dplyr,[234]^64 and
matrix[235]^65 packages were used for data wrangling tasks and
Seurat[236]^58 for working with the scRNA-seq object. The plots were
generated using ggplot2,[237]^66 ggpubr,[238]^67 and cowplot.[239]^68
The ML model was built using XGBoost[240]^19 for classification
purposes, and the gene selection was performed using
Ckmeans.1d.dp.[241]^69^,[242]^70 All ML computations, including
hyperparameter optimization tasks, were performed through parallel
computing using #cores between 30 to 100.[243]^62
scRNA-seq data description and analysis
The single-cell sequencing (scRNA-seq) experiments were performed using
the Single-cell 3′ Reagent v2 and v3 kits from 10X Genomics. The
detailed clinical information of donors showing their medical history,
BMI, age, auto-islet antibody test, HbA1c, and C-peptide levels is
provided ([244]Table S17). The libraries were processed using 10X
Genomics Cell Ranger v6.1 software which aligns the reads and generates
feature-barcode matrices (3' gene expression data) using a collection
of several pipelines.[245]^71 The experimental details and
pre-processing steps for all the samples from pancreatic-islet data
were followed as described previously.[246]^21^,[247]^72 We first
obtained HPAP samples from different biological conditions such as
auto-antibody positive (AAb+; n = 10; cells = 36,244), type 1 diabetes
(T1D; n = 9; cells = 34,524), and healthy controls (CTL; n = 31;
cells = 123,102). The AAb+ samples were determined based on the AAb+
screening test to measure the levels of antibodies against glutamic
acid decarboxylase (GAD) to determine positivity.[248]^21^,[249]^72 The
combined feature-barcode matrix, including cells from all HPAP samples,
added to a total of 193,870 cells. We pre-processed the raw data
following the protocol described previously,[250]^21^,[251]^72
excluding type-2 diabetes (T2D) samples. We also performed an
additional quality control step by removing the mitochondrial and
ribosomal reads from the raw data. We then used the exact pipeline
described in[252]^21^,[253]^72 for downstream analysis using Seurat
v4.1.0[254]^58 for creating the single-cell object, scDblFinder
v1.8.0[255]^59 for removing doublets from the data,
SingleCellExperiment v1.16.0[256]^60 for data wrangling, sctransform
v0.3.3[257]^61 for data transformation through normalization and
scaling, and finally scSorter v0.0.2[258]^22 for cell-type annotations.
The final processed and annotated scRNA-seq data object contained a
total of 169,027 cells and 30,002 genes for which transcripts could be
detected in at least on cell.
Machine learning classification network architecture and training protocol
The annotated scRNA-seq data includes 50 HPAP donor samples with
169,027 cells and 30,002 genes from three groups of CTL (n = 31), AAb+
(n = 10), and T1D (n = 9). The XGBoost machine learning framework is
constituted of inner and outer loops. In the outer loop, we first
randomly split the pre-processed SCT normalized single-cell data into
training and testing sets with 70% and 30% of HPAP samples,
respectively. In the inner loop, the training was subjected to
hyperparameter tuning, and a 5-fold cross-validation was performed
across 200 sub-training model iterations to select the best model with
a minimum error rate. We repeated this hyperparameter optimization
procedure ten times to select the best weights and parameters. The
final best weights and parameters obtained were applied to the complete
70% of the training data matrix to obtain the final best training
model. Lastly, we applied 30% of the test data matrix on the
best-trained model to evaluate predictions. We used the evaluation
metrics of accuracy, sensitivity, and specificity on the test data to
measure the performance of the model. Along with these metrics, we also
obtained the ranked list of genes, also called the best features, from
the best-trained model.
To increase the robustness of our approach and attain reliable results,
we repeated this entire process which consists of randomly splitting
the single-cell data into training (70%) and testing (30%) by using
random sampling without replacement in the outer loop for 100
iterations. The mean and SD classification accuracy, sensitivity, and
specificity across 100 XGBoost iterations were calculated for the final
average performance evaluation. For the final significant gene
selection, we built a matrix of genes X iterations. The ranked list of
genes obtained in each outer loop iteration was added to the matrix. We
counted the gene occurrences across all 100 iterations and ranked them
accordingly. The final significant gene list consisted of the list of
genes with the count showing how many times they were selected in the
best-trained model in 100 iterations. The complete workflow of the
XGBoost scRNA-seq ML framework is shown in [259]Figure 2A. Lastly, we
also evaluated several other machine learning models such as support
vector machines (SVMs) with linear and radial kernels,[260]^73 and
naive bayes[261]^74 methods for comparing the performance with XGBoost
method.
XGBoost method
eXtreme Gradient Boosting (XGBoost)[262]^19 is an extension of
gradient-boosted decision trees (GBDT) and is specifically optimized to
provide faster computations through parallel and distributed computing.
This ensemble learning method uses a boosting approach to improve
prediction accuracy by building many aggregated trees to form a single
consensus prediction model.[263]^75 The least squares loss function is
used to reduce the loss.[264]^76 The XGBoost method creates trees, and
the residuals obtained from previous trees are given as input to the
subsequent tree, which improves the overall prediction by modeling the
errors. For each tree sequence, a subsample of training data is
randomly drawn without replacement from the entire training data and is
used to fit the tree and compute the model update. Additional trees are
not allowed to be added after reaching the pre-specified maximum
threshold or if the models converges and the performance does not
improve which helps to avoid over parametrization. Overall, this highly
effective scalable tree boosting system was originally proposed for
sparse data and weighted quantile sketch for approximate tree
learning.[265]^19 XGBoost method is applied in a wide range of
applications such as regression, classification, ranking, and
user-defined prediction problems.
Feature importance score
The trained XGBoost model automatically provides a feature importance
score.[266]^77 The score indicates how important the feature (gene in
this study) was used for the model’s prediction. The feature importance
score was obtained for each trained model. We selected the genes with
non-zero importance scores for all models. We obtained the ranked lists
of genes across 100 repetitions of train-test splits.
Hyperparameter optimization (HPO)
The XGBoost method uses several parameters to control the bias-variance
tradeoffs. The tree-based boosting models could suffer from
overfitting; however, the XGBoost method provides several parameters,
such as maximum tree depth, minimum leaf weight, and minimum split
gain, which helps to avoid overfitting. Additionally, it adds
randomness to the model during the training phase, making it more
robust to noise. Following are the list of parameters set during cross
validation on training, max_depth-the maximum depth of the tree between
3 and 7, gamma-represents the minimum loss reduction required to make a
further partition on a leaf node of the tree and it was set between 0
and 0.2, eta-the step size of each boosting step was set using random
uniform distribution between 0.01 and 0.3, the min_child_weight-was set
to 30 as large number is usually conservative, subsample-the subsample
ratio will randomly sample the training data prior to growing trees
which will avoid overfitting was set between 0.5 and 0.8 using random
uniform distribution, colsample_bytree-was also set between 0.5 and
0.8, cv.nrounds-200 represents the maximum number of rounds for cross
validation, and the cv.nfold = 5 shows the 5-fold cross validation,
early_stopping_rounds = 100 which helps the model to stop training
further if the best performance was achieved, eval_metric=logloss was
used to minimize the loss, Binary=logistic was used for performing
logistic regression for binary classification and output the
probabilities, nthreads = 30, for parallelization of all the tasks. In
each inner loop, we iterated ten times over the list of above-described
parameters to identify the best parameters and used them to train the
XGBoost model and test the predictions through the test dataset.
Lastly, this procedure was repeated 100 times in the outer loop, and
the average performance metrics were calculated.
Leave one out cross-validation strategy (LOOCV)
We used a special case of k-fold cross-validation called LOOCV where
for each sample in the dataset, removing that sample (testing),
training was performed on all the remaining samples. We used the LOOCV
instead of previous criteria of training (70%) and testing (30%), where
random sampling without replacement was considered in the outer loop
for 100 iterations. The same parameters previously mentioned in HPO
pipeline were used here for optimizing the model during training. The
LOOCV helps to classify disease vs. normal group at subject level where
the subject being tested is classified as either disease or normal.
Evaluating performance
We measure the performance of XGBoost models using averages of
accuracy, sensitivity, and specificity obtained across 100 iterations.
The equations for these metrics with respect to true positive (TP),
true negative (TN), false negative (FN), and false positive (FP) are as
follows:
The classification accuracy of the model is defined as the ratio of
correctly predicted instances (TP + TN) to the total number of
instances in the dataset (TP + TN + FP + FN)
[MATH:
Accurac
y=TP+TNT<
mi>P+TN+FP+FN
:MATH]
Sensitivity measures the model’s ability to correctly identify positive
instances (TP) out of all the instances that are actually positive
(TP + FN).
[MATH:
Sensiti
vity=TPTP+FN
:MATH]
And finally, Specificity shows the model’s ability to correctly
identify negative instances (TN) out of all the instances that are
actually negative (TN + FP).
[MATH:
Specifi
city=TNTN+FP
:MATH]
Gene selection and pathway enrichment analysis
We obtained the ranked lists across 100 repetitions of train-test
splits (Methods, section- Feature Importance score) and followed two
strategies with different criteria to obtain list of ranked gene lists:
Ranked gene selection
We aggregated the ranked lists across 100 repetitions by preserving the
ranks by applying the robust rank aggregation (RRA)[267]^24 method to
obtain the final ranked list of genes in a given comparison. RRA is a
statistical technique used to combine rankings from multiple sources
into a single, aggregated ranking that is robust to noise,
inconsistencies, and outliers in the individual rankings. The RRA
method assigns scores in terms of P-values for each gene to determine
the significance level. We used genes with P-value <0.05 for the final
ranked list of genes for all comparisons ([268]Table S3. Ranked
significant gene lists obtained through RRA approach (p < 0.05) in
unannotated (all cells) and annotated cells (acinar, alpha, beta,
delta, ductal, immune, endothelial, and stellates) for T1D vs. CTL
pairwise comparison related to Figure 3, [269]Table S4. Ranked
significant gene obtained through RRA approach (p < 0.05) in
unannotated (all cells) and annotated cells (acinar, alpha, beta,
delta, ductal, immune, endothelial, and stellates) for T1D vs. AAb+
pairwise comparison, related to Figure 3, [270]Table S5. Ranked
significant gene lists obtained through RRA approach (p < 0.05) in
unannotated (all cells) and annotated cells (acinar, alpha, beta,
delta, ductal, immune, endothelial, and stellates) for AAb+ vs. CTL
pairwise comparison, related to Figure 3, [271]Table S18. GO-BP and
KEGG pathways for unannotated (all cells) and annotated cells (acinar,
alpha, beta, delta, ductal, immune, endothelial, and stellates) using
clusterProfiler in T1D vs. CTL pairwise comparison based on ranked gene
lists based on RRA (p < 0.05) from Table S3, related to Figure 3,
[272]Table S19. GO-BP and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using clusterProfiler in T1D vs. AAb+
pairwise comparison based on ranked gene lists based on RRA (p < 0.05)
from Table S4, related to Figure 3, [273]Table S20. GO-BP and KEGG
pathways for unannotated (all cells) and annotated cells (acinar,
alpha, beta, delta, ductal, immune, endothelial, and stellates) using
clusterProfiler in AAb+ vs. CTL pairwise comparison based on ranked
gene lists based on RRA (p < 0.05) from Table S5, related to Figure 3,
[274]Table S21. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S3 using RRA approach in
T1D vs. CTL, related to Figure 3, [275]Table S21. Shared list of ranked
genes called RRA_combined obtained by combining gene lists from
Table S3 using RRA approach in T1D vs. CTL, related to Figure 3,
[276]Table S22. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S4 using RRA approach in
T1D vs. AAb+, related to Figure 3, [277]Table S22. Shared list of
ranked genes called RRA_combined obtained by combining gene lists from
Table S4 using RRA approach in T1D vs. AAb+, related to Figure 3,
[278]Table S23. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S5 using RRA approach in
AAb+ vs. CTL, related to Figure 3, [279]Table S23. Shared list of
ranked genes called RRA_combined obtained by combining gene lists from
Table S5 using RRA approach in AAb+ vs. CTL, related to Figure 3).
Unranked gene selection
We aggregated the ranked lists across 100 repetitions by ignoring the
ranks and focusing on the number of times a gene was selected across
100 repetitions. For example, in beta cells of T1D vs. CTL classifier
results, if gene INS is selected 80 times among 100 repetitions, then a
selection frequency score of 80 was assigned to INS. The final ordered
list of genes with score as gene selection frequency among 100
repetitions for all comparisons ([280]Table S6. Unranked significant
gene lists with gene selection frequency in unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) for T1D vs. CTL pairwise comparison,
related to Figure 3, [281]Table S7. Unranked significant gene lists
with selection frequency in unannotated (all cells) and annotated cells
(acinar, alpha, beta, delta, ductal, immune, endothelial, and
stellates) for T1D vs. AAb+ pairwise comparison, related to Figure 3,
[282]Table S8. Unranked significant gene lists with selection frequency
in unannotated (all cells) and annotated cells (acinar, alpha, beta,
delta, ductal, immune, endothelial, and stellates) for AAb+ vs. CTL
pairwise comparison, related to Figure 3, [283]Table S15. GO (GO-BP,
GO-MF, GO-CC) and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using DAVID in T1D vs. CTL pairwise
comparison based on unranked gene lists based on gene selection
frequency >50 from Table S6, related to Figure 3, [284]Table S16. GO
(GO-BP, GO-MF, GO-CC) and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using DAVID in T1D vs. AAb+ pairwise
comparison based on unranked gene lists based on gene selection
frequency >50 from Table S7, related to Figure 3, [285]Table S24. GO
(GO-BP, GO-MF, GO-CC) and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using DAVID in AAb+ vs. CTL pairwise
comparison based on unranked gene lists based on gene selection
frequency >50 from Table S8, related to Figure 3).
For the ranked list of genes, we performed pathway analysis using the
ClusterProfiler[286]^78 tool. The clusters were created for the gene
ontology (GO) biological process (BP) and Kyoto Encyclopedia of Genes
and Genomes (KEGG) pathways based on the ranked list of genes across
each comparison. Our analysis determined the GO biological processes BP
and KEGG pathways with an adjusted P-value <0.05 as statistically
significant ([287]Table S18. GO-BP and KEGG pathways for unannotated
(all cells) and annotated cells (acinar, alpha, beta, delta, ductal,
immune, endothelial, and stellates) using clusterProfiler in T1D vs.
CTL pairwise comparison based on ranked gene lists based on RRA
(p < 0.05) from Table S3, related to Figure 3, [288]Table S19. GO-BP
and KEGG pathways for unannotated (all cells) and annotated cells
(acinar, alpha, beta, delta, ductal, immune, endothelial, and
stellates) using clusterProfiler in T1D vs. AAb+ pairwise comparison
based on ranked gene lists based on RRA (p < 0.05) from Table S4,
related to Figure 3, [289]Table S20. GO-BP and KEGG pathways for
unannotated (all cells) and annotated cells (acinar, alpha, beta,
delta, ductal, immune, endothelial, and stellates) using
clusterProfiler in AAb+ vs. CTL pairwise comparison based on ranked
gene lists based on RRA (p < 0.05) from Table S5, related to Figure 3).
Next, we created a shared list of ranked genes using the RRA method and
performed GO and KEGG pathway analysis using clusterProfiler[290]^78
and Metascape,[291]^79 based on the ranked gene lists obtained for
different cell types. The ranked gene lists across different cell types
in each comparison ([292]Table S3. Ranked significant gene lists
obtained through RRA approach (p < 0.05) in unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) for T1D vs. CTL pairwise comparison related
to Figure 3, [293]Table S4. Ranked significant gene obtained through
RRA approach (p < 0.05) in unannotated (all cells) and annotated cells
(acinar, alpha, beta, delta, ductal, immune, endothelial, and
stellates) for T1D vs. AAb+ pairwise comparison, related to Figure 3,
[294]Table S5. Ranked significant gene lists obtained through RRA
approach (p < 0.05) in unannotated (all cells) and annotated cells
(acinar, alpha, beta, delta, ductal, immune, endothelial, and
stellates) for AAb+ vs. CTL pairwise comparison, related to Figure 3,
[295]Table S18. GO-BP and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using clusterProfiler in T1D vs. CTL
pairwise comparison based on ranked gene lists based on RRA (p < 0.05)
from Table S3, related to Figure 3, [296]Table S19. GO-BP and KEGG
pathways for unannotated (all cells) and annotated cells (acinar,
alpha, beta, delta, ductal, immune, endothelial, and stellates) using
clusterProfiler in T1D vs. AAb+ pairwise comparison based on ranked
gene lists based on RRA (p < 0.05) from Table S4, related to Figure 3,
[297]Table S20. GO-BP and KEGG pathways for unannotated (all cells) and
annotated cells (acinar, alpha, beta, delta, ductal, immune,
endothelial, and stellates) using clusterProfiler in AAb+ vs. CTL
pairwise comparison based on ranked gene lists based on RRA (p < 0.05)
from Table S5, related to Figure 3, [298]Table S21. Shared list of
ranked genes called RRA_combined obtained by combining gene lists from
Table S3 using RRA approach in T1D vs. CTL, related to Figure 3,
[299]Table S21. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S3 using RRA approach in
T1D vs. CTL, related to Figure 3, [300]Table S22. Shared list of ranked
genes called RRA_combined obtained by combining gene lists from
Table S4 using RRA approach in T1D vs. AAb+, related to Figure 3,
[301]Table S22. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S4 using RRA approach in
T1D vs. AAb+, related to Figure 3, [302]Table S23. Shared list of
ranked genes called RRA_combined obtained by combining gene lists from
Table S5 using RRA approach in AAb+ vs. CTL, related to Figure 3,
[303]Table S23. Shared list of ranked genes called RRA_combined
obtained by combining gene lists from Table S5 using RRA approach in
AAb+ vs. CTL, related to Figure 3) were given as input to the RRA
method. The obtained shared list of ranked genes was called
“RRA_combined” and reported along with the corresponding GO and KEGG
pathways ([304]Table S21. Shared list of ranked genes called
RRA_combined obtained by combining gene lists from Table S3 using RRA
approach in T1D vs. CTL, related to Figure 3, [305]Table S21. Shared
list of ranked genes called RRA_combined obtained by combining gene
lists from Table S3 using RRA approach in T1D vs. CTL, related to
Figure 3, [306]Table S22. Shared list of ranked genes called
RRA_combined obtained by combining gene lists from Table S4 using RRA
approach in T1D vs. AAb+, related to Figure 3, [307]Table S22. Shared
list of ranked genes called RRA_combined obtained by combining gene
lists from Table S4 using RRA approach in T1D vs. AAb+, related to
Figure 3, [308]Table S23. Shared list of ranked genes called
RRA_combined obtained by combining gene lists from Table S5 using RRA
approach in AAb+ vs. CTL, related to Figure 3, [309]Table S23. Shared
list of ranked genes called RRA_combined obtained by combining gene
lists from Table S5 using RRA approach in AAb+ vs. CTL, related to
Figure 3).
For the unranked list of genes, the publicly available databases such
as the ‘database for annotation, visualization, and integrated
discovery’ (DAVID, version 6.8) bioinformatics web server[310]^80 was
used to identify various biological pathways of significant genes
through a set of functional annotation tools. We first filtered high
confidence genes by selection genes with a selection frequency of
greater than 50 and used the GO[311]^81^,[312]^82 and the
KEGG[313]^83^,[314]^84^,[315]^85 databases for pathway enrichment
analysis in the DAVID database. The GO classifies the gene
functionalities into three categories: biological processes (BP),
cellular component (CC), and molecular function (MF), and the KEGG
database provides an overview of high-level gene functions and
biological signaling pathways. Our analysis determined the GO and KEGG
terms with false discovery rate (FDR) < 0.05 as statistically
significant ([316]Tables S15, [317]S16, and [318]S24).
Protein-protein interaction networks and gene modules selection
The Search Tool for Retrieval of Interacting proteins database
(STRINGdb v11)[319]^86 was used to identify the various protein-protein
interaction networks (PPI) of significant gene lists based on the
medium confidence score of 0.7. The loaded PPI network from
STRINGdb[320]^86 was analyzed using the open-source
Cytoscape[321]^87^,[322]^88 tool. Cytoscape is a bioinformatics
software used for visualizing and integrating highly complex PPI
networks through several supported plugins. The StringApp[323]^87
within Cytoscape is used to load the raw PPI network, and the network
analyzer[324]^89 plugin is used to measure the degree of interaction
between nodes and display the up/down-regulated genes. Finally, we
applied the molecular complex detection (MCODE)[325]^25
clustering-based algorithm for further splitting the network into
modules/clusters that helped identify the densely connected regions. We
used the default parameters (degree cutoff = 2, node score cut-off =
0.2, k-core = 2, and max.depth = 100) to filter and identify key
clusters in the network. We selected the top modules from each analysis
group to show the degree of interactions.
Differential expression analysis
The differential expression (DE) analysis was performed on the
LogNormalized data using Seurat’s ‘FindMarkers’ function[326]^58 where
the DE test uses non-parametric Wilcoxon rank-sum test as default to
test for the DE genes between two groups of cells within same cell type
across different conditions. Here, each cell is treated as an
independent replicate. We also performed pseudobulk analysis where the
gene counts of all cells within donors were aggregated using Seurat’s
‘AggregateExpression’ and the sample level DE analysis was performed
using DESeq2.[327]^90 In both individual single-cell and pseudobulk
strategies, the significant DE genes were filtered based on threshold
of adjusted p-value <0.05.
Quantification and statistical analysis
The differentially expressed genes were determined based on adjusted
p-value <0.05 in both individual cells and Pseudobulk strategies using
non-parametric Wilcoxon rank-sum test and DESeq2 approaches
respectively.
Acknowledgments