Abstract

Introduction

   Ageing in the human bone marrow is associated with immune function
   decline that results in the elderly being vulnerable to illnesses. A
   comprehensive healthy bone marrow consensus atlas can serve as a
   reference to study the immunological changes associated with ageing,
   and to identify and study abnormal cell states.

Methods

   We collected publicly available single cell transcriptomic data of 145
   healthy samples encompassing a wide spectrum of ages ranging from 2 to
   84 years old to construct our human bone marrow atlas. The final atlas
   has 673,750 cells and 54 annotated cell types.

Results

   We first characterised the changes in cell population sizes with
   respect to age and the corresponding changes in gene expression and
   pathways. Overall, we found significant age-associated changes in the
   lymphoid lineage cells. The naïve CD8^+ T cell population showed
   significant shrinkage with ageing while the effector/memory CD4^+ T
   cells increased in proportion. We also found an age-correlated decline
   in the common lymphoid progenitor population, in line with the commonly
   observed myeloid skew in haematopoiesis among the elderly. We then
   employed our cell type-specific ageing gene signatures to develop a
   machine learning model that predicts the biological age of bone marrow
   samples, which we then applied to healthy individuals and those with
   blood diseases. Finally, we demonstrated how to identify abnormal cell
   states by mapping disease samples onto the atlas. We accurately
   identified abnormal plasma cells and erythroblasts in multiple myeloma
   samples, and abnormal cells in acute myeloid leukaemia samples.

Discussion

   The bone marrow is the site of haematopoiesis, a highly important
   bodily process. We believe that our healthy bone marrow atlas is a
   valuable reference for studying bone marrow processes and bone
   marrow-related diseases. It can be mined for novel discoveries, as well
   as serve as a reference scaffold for mapping samples to identify and
   investigate abnormal cells.

   Keywords: bone marrow, single-cell atlas, ageing, inflammageing, data
   integration

Introduction

   The human immune system plays a crucial role in fending off challenges
   from viruses and microbes, as well as malignancies. As an individual
   ages, the immune system ages alongside, characterised by immune cell
   population size changes, functional capability alterations, and
   mutation accumulation ([29]1). These degradations of the immune system
   in turn increase the risks of infections and cancers ([30]2, [31]3).
   Chronic, low-grade inflammation, or inflammageing, also develops with
   age and is a risk factor for various diseases including diabetes
   mellitus and cardiovascular diseases ([32]4, [33]5). As the bone marrow
   (BM) is the main source of new immune cells, it is important to study
   ageing-related changes in the bone marrow and how they contribute to
   the weakening of the immune system.

   To date, most immune ageing studies employed flow cytometry to analyse
   peripheral blood samples ([34]1). Flow cytometry has high throughput
   and is low cost compared to sequencing experiments but can only measure
   20+ parameters, or 40+ parameters for destructive mass cytometry. To
   probe transcriptome-wide changes, sorted cells can be subjected to bulk
   sequencing but this limits the analysis of differences to predefined
   cell types. Mouse samples are also commonly used as they are the
   easiest to obtain. For human-based studies, the reported age-associated
   changes in peripheral blood cell population abundance include shrinking
   naïve T cell populations, increasing effector/memory and regulatory T
   cell populations, shrinking B cell subsets, and increasing monocyte
   populations ([35]1). Overall, there is also a clonal shift towards
   myeloid-biased hematopoietic stem cells (HSCs), which result in a skew
   towards circulating myeloid populations ([36]6).

   In the past decade, advances in single-cell technologies have lowered
   costs while increasing the scale of data generated. This has spurred an
   increasing number of studies exploiting single-cell RNA sequencing
   (scRNA-Seq) to profile different tissues including the bone marrow. The
   consequent explosion of publicly available data affords us
   opportunities to construct large scale cell atlases from a wide array
   of sample datasets. To date, several human bone marrow atlases have
   been constructed. Most incorporate both healthy and diseased samples to
   investigate the differences between them with only a handful of studies
   that focus on healthy bone marrow. Hay et al. sequenced over 100,000
   cells from eight healthy donors, spanning 35 annotated cell types
   ([37]7). They also characterised the immunological differences due to
   gender and age. They detected minimal gender-specific differences but
   HSC frequency was found to reduce with age. However, the low number of
   healthy donors makes it difficult to establish strong and generalisable
   conclusions. The Human Cell Atlas also hosts a larger updated census of
   more than half a million immune cells from samples that include bone
   marrow and umbilical cord blood ([38]8). However, there currently
   appears to be no associated analysis published.

   Here we present our healthy bone marrow atlas constructed with 145
   publicly available scRNA-Seq datasets from 22 studies. As a single cell
   resource, it can be mined to gain insights into healthy bone marrow
   tissue and serve as a reference onto which we can map disease samples
   to investigate disease pathology. We first employed it to investigate
   ageing-related changes in the bone marrow. We identified cell
   populations that proportionally change with age and the related changes
   in gene expression and associated pathways. We then trained an age
   predictor model with cell type-specific ageing gene signatures to
   investigate the apparent age of disease samples with respect to their
   chronological age. There we found acute myeloid leukaemia (AML) samples
   to have a lower apparent age for more aged samples while younger
   samples had a higher predicted age. Finally, we mapped blood cancer
   samples onto the atlas to identify abnormal cell types and associated
   transcriptomic changes. We were able to identify abnormal plasma cells
   and erythroblasts in multiple myeloma (MM) samples, and abnormal cells
   in acute myeloid leukaemia samples.

Results

Construction of human healthy bone marrow cell atlas and its application in
identifying cells and gene signatures associated with ageing and diseases

   To construct our reference healthy human BM atlas, we used 145 publicly
   available human BM scRNA-Seq datasets with publicly available
   sequencing reads from 22 projects ([39] Figure 1A ). Of the 145
   samples, 92 samples have age information, with age ranging from 2 to 84
   years and a median of 45 years ([40] Supplementary Table 1 ). Our
   collected data also includes 34 foetal BM samples. From samples with
   gender information, the gender ratio is relatively balanced at 54
   female and 45 male samples. 91 samples were sorted using a variety of
   strategies. All samples were sequenced using the 10x Genomics
   sequencing platforms.

Figure 1.

   [41]Figure 1
   [42]Open in a new tab

   Healthy BM atlas construction, annotation, and analysis. (A)
   Integration and analysis of publicly available healthy BM samples to
   construct the atlas. 145 healthy BM samples were used to build the
   final atlas with 673,750 cells in 54 cell types. Blood disease samples
   of acute myeloid leukaemia and multiple myeloma were projected onto the
   healthy BM atlas to identify disease-specific populations.
   Age-correlated cell type frequency changes and associated dysregulated
   genes and pathways were identified. Identified age-associated gene
   signatures were then used with machine learning to predict cell age.
   (B) UMAP of the integrated healthy BM atlas with cells coloured by
   their cell type annotation. For an interactive version of this atlas,
   please visit DISCO at
   [43]http://www.immunesinglecell.org/atlas/bone_marrow.

   The raw reads were previously processed as part of the DISCO database
   ([44]9). We employed our pipeline to remap the reads to a single
   reference genome, GRCh38 (Ensembl 93), for annotation consistency and
   to reduce potential technical effects. We then processed the resulting
   read counts using the Seurat package ([45]10). Standard quality control
   steps on UMI counts, number of detected genes, and the fraction of
   mitochondrial and ribosomal genes were applied to filter out low
   quality cells. To remove the significant batch effects present, we
   employed our FastIntegration tool developed for atlas-scale integration
   ([46]11) ([47] Supplementary Figure 1 ). This was followed by
   unsupervised clustering and differential gene expression analysis. We
   then annotated each cluster’s cell type using canonical marker genes
   ([48] Supplementary Figure 2 and [49]Supplementary Table 2 ).

   Our constructed healthy BM atlas is currently the largest among
   existing healthy BM atlases ([50]7, [51]12, [52]13) in both cell count
   and number of donor samples. It consists of 673,750 cells with 54
   annotated cell types ([53] Figure 1B ). The cell types present mostly
   overlap with the previously constructed healthy BM atlas by Hay et al.
   ([54]7) but are annotated at higher resolution of cell subtypes with
   known markers. Our annotated cell types can be divided into five major
   groups: T/NK cells, B cells, monocytes and DCs, progenitor cells, and
   erythrocytes/megakaryocytes. Other smaller clusters of cells identified
   are osteoclasts, fibroblast, and mesenchymal stromal cells. For the
   T/NK cells, we could divide the naïve population into CD4^+ and CD8^+
   subtypes, the CD8^+ T cells into GZMK and GZMB subtypes, and NK cells
   into CD16^+ and CD56^+ subtypes. We also identified mucosal-associated
   invariant T (MAIT), gamma-delta T, and regulatory T (Treg) cells. For B
   cells, we could identify the subtypes along its developmental path from
   the common lymphoid progenitors (CLP) to pro-B, pre-B, naïve B, memory
   B, and plasma cells. The monocyte populations found were the CD14^+ and
   CD16^+ subtypes, while the dendritic cells were divided into classical
   dendritic cells (cDCs) and plasmacytoid dendritic cells (pDCs). We
   compared our annotations with available annotations of the contributing
   samples ([55]GSE185381 ([56]14) and Census of Immune Cells ([57]8)) and
   we found good concordance among the major cell types ([58]
   Supplementary Figure 3 ).

   Using the unsorted samples, we computed the distributions of cell type
   proportions ([59] Figure 2A ). Mature T cells formed the majority of
   cells in the bone marrow with naïve CD4^+ T cells being the most
   numerous. Naive CD8^+ T cells expectedly showed a much lower average
   percentage with this skew being attributed to the high failure rate of
   CD8^+ T cells during selection in the thymus ([60]15). Other cell
   types, namely monocytes, B cells, and NK cells, made up significant
   fractions. Overall, these proportions are in line with previous studies
   on the cell types present in the bone marrow ([61]13).

Figure 2.

   [62]Figure 2
   [63]Open in a new tab

   Detailed cell type populations and annotation. (A) Percentage of cells
   for each cell type in unsorted bone marrow samples. (B) pDC development
   pathways of myeloid and lymphoid origins. (C) Treg cell population
   identified by canonical markers. (D) Osteoclast/macrophage cells
   identified by canonical markers.

   By integrating a large number of cells into a single atlas, there are
   sufficient cells to populate the secondary developmental pathways. Here
   our atlas illustrates the different developmental pathways of pDCs
   (marked by IRF7), namely myeloid and lymphoid origins ([64]16) ([65]
   Figure 2B ). The myeloid origin pathway (blue) traces from the DC
   precursor populations and is marked by the myeloid-associated LYZ. The
   lymphoid origin pathway (green) traced from the CLP population and is
   marked by JCHAIN and HIST1H4C. Within the T cell population, we could
   also distinguish the Treg cell subset ([66] Figure 2C ). The large cell
   count enabled us to increase the resolution of clustering and identify
   rare cell types in the bone marrow, such as osteoclasts ([67] Figure 2D
   ) and mesenchymal stromal cells. These cells were not annotated in
   previous healthy bone marrow atlases. As the integration included
   foetal data samples, we could identify foetal erythrocytes expressing
   HBG1 and HBG2 (foetal HBG^+ erythrocyte).

Bone marrow atlas captures age-related changes in immune cell populations

   In ageing studies, blood cell population changes are typically analysed
   using peripheral blood samples. We first examined cell type frequency
   changes in the bone marrow and compared them against those reported in
   human blood samples. Due to large variances in cell populations across
   samples and studies, we only used samples from three studies, namely
   [68]GSE120221 with 25 samples ([69]13), [70]GSE185381 with 10 samples
   ([71]14), and the Census of Immune Cells dataset with 8 samples
   ([72]8). We selected these studies as they had large numbers of
   unsorted samples with age information. The remaining studies were
   either composed of sorted samples or had too few donor samples (<5).

   We tested the correlations for each annotated cell type using a linear
   model with the study batch as covariate. Two cell types,
   effector/memory CD4^+ T and CD16^+ NK cells, had statistically
   significant positive correlations across all samples ([73] Figure 3A ,
   [74]Supplementary Table 3 ). The effector/memory CD4^+ T cell’s
   correlation was statistically significant when combining the regression
   output from all three studies and even within the [75]GSE120221 and
   Census study sets, giving us the greatest confidence in this result.
   Moreover, this correlation has also been found in peripheral blood by
   Li et al. ([76]17). Increasing proportions of CD16^+ NK cells have also
   been reported in the blood of elderly individuals ([77]18).

Figure 3.

   [78]Figure 3
   [79]Open in a new tab

   Analyses of age-correlated changes in the bone marrow. (A) Regression
   plots showing significant changes in cell type frequencies with respect
   to age for naïve CD8^+ T cell, effector/memory CD4^+ T cell, and common
   lymphoid progenitor in the bone marrow. (B) UpSet plot with set size
   depicting the number of significant age-related DEGs for each of the
   respective cell types and intersection size illustrating the number of
   age-related DEGs unique to a cell type or shared with other cell types
   depending on which cells have been filled in the respective columns.
   (C) Regression plots showing the upregulation of sample averaged RPL13A
   expression for multiple cell types across age. (D) Regression plots of
   sample averaged expression of GADD45 and ZBTB16. (E) Heatmap of
   age-associated dysregulated pathways in different cell types.

   Among the annotated cell types negatively correlated with age, the
   naïve CD8^+ T cell population had the largest average coefficient and
   was consistent across the three studies. Decline in naïve T cell
   populations have been reported for peripheral blood in different
   studies ([80]17, [81]19, [82]20) and is well recognised. For the
   corresponding naïve CD4^+ T cell population, we obtained a negative
   correlation with the [83]GSE120221 and [84]GSE185381 sets, but a
   positive correlation with the Census study set ([85] Supplementary
   Figure 4 ). The latter result is unexpected and can be explained by the
   small number of samples (n=8) coupled with high variability within the
   data that can mask the expected age-related decline. Another notable
   cell type negatively correlated with age is the CLP population. As
   progenitor cell types, changes in these populations have
   disproportionate impact on mature cell type populations and
   consequently the immune system’s functions. This decline in the
   proportion of lymphoid-biased progenitors and consequent myeloid bias
   in haematopoiesis is widely reported ([86]21). The computed
   correlations for all cell types are given in [87]Supplementary Table 3
   .

Cell type-specific ageing gene signature and pathway changes correlated with
ageing

   We next investigated the cell type-specific genes with expression that
   correlate with age, using all samples with age metadata ([88] Figure 3B
   ). Among the annotated cell types, GMP cells had the largest number of
   age-associated differentially expressed genes (DEGs) with 663 genes,
   while IFN-activated T cells had the least with 107 genes. Most of these
   genes were shared among few cell types, with only 20 genes shared in 10
   or more cell types. The most conserved DEG was RPL13A in 17 cell types
   ([89] Figure 3C ). RPL13A’s upregulation with age has also been
   reported for almost all tissue types of mice ([90]22). Other genes of
   note include IMMP2L and DIP2B which showed downregulation with age in
   10 cell types. Switching off IMMP2L signalling has been shown to drive
   cell senescence ([91]23). Similarly, DIP2B knockout cells expressed
   senescence markers found in ageing cells ([92]24). We also found
   positive age correlations for chemokines CCL4, CCL4L2, and CCL5 in
   tissue-resident NK cells, CCL4L2 in CD16^+ NK cells, CCL4 in CD56^+ NK
   cells, and CCL3 in GZMK^+ CD8^+ T cells ([93] Supplementary Table 4 ).

   With ageing, the skew towards myeloid lineage haematopoiesis is well
   documented ([94]25). Within the MPP population, we investigated
   transcriptomic changes that can contribute to the decline in
   differentiation towards the lymphoid lineage. Here we found two notable
   genes, GADD45B which was upregulated with respect to age, and ZBTB16
   that was downregulated ([95] Figure 3D ). GADD45B has been
   characterised as a myeloid differentiation gene and plays a role in the
   response of myeloid cells to stress stimulation ([96]26, [97]27), while
   the ZBTB16 gene is a negative regulator of myeloid cell development
   ([98]28).

   We followed up with pathway analysis of the age correlated genes using
   EnrichR ([99]29). Here we show the enriched Reactome pathways ([100]
   Figure 3E ). The different cell types show a wide range of dysregulated
   pathways. We note that the CD14^+ MHCII^low monocytes, late
   haemoglobin^+ erythrocytes, memory B cells, and pre-B cells similarly
   show dysregulated translation processes. Other B cell subtypes, namely
   pro-B cells and naïve B cells, show dysregulated mRNA processing
   pathways, and the naïve B cell subset also show dysregulated cellular
   stress pathways. For the interferon-activated T cells, the dysregulated
   pathways are primarily centred around metabolism and interferon
   signalling. The metabolic changes in T cells due to ageing is linked to
   cell senescence and reduced functionality, though the exact mechanisms
   are under investigation ([101]30).

Age prediction of disease samples show divergence from chronological age

   Cancer is generally recognised as an ageing-related disease. While
   ageing increases the risk of cancer, it has also been reported that
   cancer can alter the expression trends of ageing-related genes
   ([102]31). We theorised that diseases including cancer can modify the
   gene expression of diseased cells to appear younger or older than their
   chronological age. Thus, we investigated how diseased cells’ age
   predicted by their transcriptome differ from their chronological age.
   We first trained a cell age predictor using the healthy atlas. The
   predictor was constructed using elastic net regression and the
   age-related DEGs identified for each cell type. For each sample, we
   computed the age of each cell type present, and the median predicted
   age was used as the overall predicted age. The feature gene set were
   selected via regularisation in the elastic net regression, and we
   employed 10-fold cross validation to check the hyperparameters ([103]
   Supplementary Figure 5 ). The resulting predictor was able to predict
   the ages of healthy individuals with a correlation coefficient of 0.92
   and an error of 6.3 years ([104] Figure 4A ). We further tested our
   predictor using the corresponding cell types from healthy blood
   samples, obtaining good predictions with CD14^+ monocytes and Treg
   cells ([105] Figure 4B ). List of blood samples are presented in
   [106]Supplementary Table 5 .

Figure 4.

   [107]Figure 4
   [108]Open in a new tab

   Age prediction using identified age-associated DEGs with machine
   learning. (A) Regression plots showing the correlation between
   chronological age and median predicted age of the respective cell types
   in healthy BM samples. (B) Regression plots showing the correlation
   between chronological age and median predicted age of select peripheral
   blood cell types. (C) Heatmap depicting the differences (error) between
   chronological age and predicted age of diseased BM samples.

   We then used the age predictor to predict the age of patients diagnosed
   with acute myeloid leukaemia (AML) and multiple myeloma (MM). The
   predicted ages of the cancer samples showed the trends of older
   patients being predicted to have a younger age while younger patients
   were predicted to have an older age ([109] Figure 4C and
   [110]Supplementary Table 6 ). For the AML samples, the predicted ages
   fall in the range of 32 to 51 years old with an average of 41.8. For
   the MM samples, the predicted ages fall in the range of 49 to 60 years
   old with an average of 53. This suggests that cancer alters gene
   expression patterns to portray a cancer-specific apparent age that may
   be different across different cancer types. In the case of AML, this
   also affects the paediatric disease subtypes. Future investigation into
   this apparent cell age phenomenon will encompass other blood cancers.

Blood cancer-specific cells identified by mapping disease samples onto atlas

   To demonstrate our healthy BM atlas’ applicability as a reference for
   identifying diseased cells, we mapped samples of AML and MM onto our
   atlas. By integrating healthy and diseased samples together, abnormal
   cells can be identified. We first mapped 10 MM samples from
   [111]GSE189460 onto our healthy atlas and performed label transfer to
   annotate the cells ([112] Figure 5A ). We also computed a cell type
   prediction confidence score based on each cell’s distance to its
   neighbours in the reference atlas (Methods). Lower prediction
   confidence scores denote mapped cells that were phenotypically
   different from cells found in the healthy atlas, which in this case
   implied diseased cells. Among the mapped cells, the lowest scores were
   found among the predicted plasma and erythroblast cells ([113]
   Figure 5B ). The predicted plasma cell types correspond to the
   malignant plasma cells that accumulate in the bone marrow while
   abnormalities in the erythroblast compartment correspond to the
   disrupted erythropoiesis process which gives rise to the common MM
   symptom of anaemia ([114]32).

Figure 5.

   [115]Figure 5
   [116]Open in a new tab

   Projecting diseased BM samples onto healthy reference to identify
   disease-specific cells. (A) Multiple myeloma datasets projected onto
   the healthy BM atlas. (B) Confidence scores of cell type prediction of
   MM samples. (C) Heatmap of DEGs between plasma cells from MM samples
   versus healthy samples. (D) Pathway enrichment analysis of MM plasma
   cell DEGs. (E) Projection of AML datasets onto the healthy BM atlas.
   (F) Confidence scores of cell type prediction with AML samples.

   We compared the malignant plasma cells to their healthy counterparts
   and obtained 80 DEGs (p< 0.05 and |logFC| > 0.5) ([117] Figure 5C and
   [118]Supplementary Table 7 ). The upregulated genes include the
   previously reported JUND ([119]33), FOSB, PDIA2 ([120]34), and CCL3
   ([121]35). Elevated levels of CCL3 in the bone marrow is of particular
   interest as it has been proposed to suppress erythropoiesis and cause
   anaemia in MM patients ([122]36). In the list of downregulated DEGs,
   low CD27 expression in malignant plasma cells has been reported to be
   correlated with poor prognosis ([123]37) while the tumour suppressor
   gene WWOX has been found downregulated due to translocations or
   deletions ([124]38). In the enrichment analysis, dysregulated pathways
   include TP53 and cytokine signalling ([125] Figure 5D ). The
   dysregulation of cytokine signalling is an expected characteristic of
   malignant plasma cells due to the high levels of cytokines that also
   serve as chemoattractants that attracts them to the bone marrow
   ([126]39).

   We next mapped 10 AML samples from [127]GSE185381 ([128]14) onto our BM
   atlas ([129] Figure 5E ). We performed label transfer to label the AML
   sample cells and computed the prediction confidence score for each
   cell. Comparison of the computed scores with the malignant cell labels
   in the original annotation showed good correlation with the malignant
   cells have low scores ([130] Figure 5F ). The malignant cells’
   transferred labels are primarily progenitor types and myeloid types
   (monocytes and dendritic cells), which have been previously reported
   ([131]40). This presents an approach towards identifying malignant
   cells as an alternative to employing mutation detection.

Discussion

   The bone marrow is the site of haematopoiesis, a highly regulated
   process that must be responsive to the body's needs. Like other bodily
   processes, haematopoiesis is profoundly affected by ageing. Well
   described by multiple studies, the ageing process brings about many
   changes to the cell populations in the bone marrow, including shrinking
   naïve cell populations, accumulating memory cell populations, and
   myeloid bias in cellular output ([132]1). In this work, we constructed
   a reference healthy BM atlas using published scRNA-Seq datasets with
   publicly available reads. The samples’ ages spanned the lifespan of
   most individuals, from 2 to 84 years of age, with 33 additional foetal
   samples. We first investigated age-related changes in the BM. We
   recapitulated some of the reported cell population changes with respect
   to age and identified accompanying gene expression changes. Among the
   age-correlated genes, most were cell type-specific with only a small
   number being shared among the different cell types. Notably, we found
   the downregulation of IMMP2L and DIP2B, which have been implicated in
   cell senescence. We also found chemokines upregulated in different NK
   cell subtypes. Future work will aim to refine the ageing-associated
   genes and pathways, and investigate the mechanisms linking to reduced
   immune functions.

   In the BM atlas constructed by Human Cell Atlas consortium ([133]7),
   HSC cell frequency decreased with age. This was not replicated in our
   analysis. This discrepancy may be explained by HSCs representing a very
   small percentage of each sample (<4%), making the measured frequency
   prone to errors. Thus, a much larger number of samples and larger cell
   counts per sample are needed to reduce the error margins and verify
   this observation. Moreover, the HCA atlas used only eight samples from
   donors between the ages of 25 and 53, which limited the study’s
   statistical power.

   We also employed the cell type-specific genes that were correlated with
   age to train a cell age predictor. With the predictor, we found
   chronologically younger blood cancer samples to have an older predicted
   age while older blood cancer samples showed a younger predicted age. As
   cancer cells possess characteristics of immortality, active cell
   division, and higher metabolic activity, they can appear phenotypically
   younger. For chronologically younger samples, the genomic and even
   environmental changes associated with cancer may also alter gene
   expression to seemingly reflect an older cellular age. Higher stress
   levels than healthy tissue, cellular dysregulation, and activated but
   ineffectual DNA repair pathways can be contributing factors as well.

   Most disease studies construct cell atlases that combine both healthy
   and diseased samples to study the differences between them. However, a
   comprehensive healthy tissue atlas is still valuable to disease
   studies. As recently demonstrated by Dann et al. ([134]41), using a
   separate healthy reference as a reference scaffold to map disease
   samples and matched healthy controls can improve the identification of
   disease-associated cell states and reduce the number of control samples
   while preserving the rate of false discoveries. Here, our constructed
   healthy BM atlas encompassing a large number of donor samples from
   multiple studies and across a wide age range, is a comprehensive
   healthy reference that can serve as a baseline for comparative studies
   with diseased samples. As bone marrow samples require an invasive
   procedure to obtain, using our healthy atlas to reduce the needed
   number of healthy control samples is greatly beneficial. For this work,
   we demonstrated the mapping of disease samples onto our atlas to
   identify diseased cell states. We identified abnormal plasma cells and
   erythroblasts in MM samples, as well as AML-specific abnormal cells.
   The diseased cells can be easily identified by their low cell type
   prediction confidence scores. We further inspected the differentially
   expressed genes of the MM plasma cells and identified previously
   reported disease markers.

   We believe that our BM atlas is a valuable reference for studying
   healthy bone marrow processes and bone marrow-related diseases. As more
   BM scRNA-Seq data become available, we will continuously update and
   improve on our atlas currently available in the DISCO ([135]11) atlas
   collection: [136]http://www.immunesinglecell.org/atlas/bone_marrow. We
   will also pursue future studies on bone marrow diseases using our atlas
   as the reference scaffold.

Methods

Single-cell RNA-Seq data collection, integration, and annotation

   We retrieved healthy bone marrow sample datasets from the DISCO
   database ([137]9). The datasets were previously preprocessed from raw
   reads and mapped onto the human reference genome, GRCh38 (Ensembl 93).
   Except for data integration with FastIntegration, we employed the
   Seurat package ([138]10) for downstream data analyses. For each sample,
   we filtered the cells based on their unique molecular identifier (UMI)
   counts, the number of detected genes, and the fraction of mitochondrial
   and ribosomal genes. As the data were acquired in different
   experiments, we utilised the distribution of QC metrics from each
   sample to manually determine the cut-offs applied for filtering.
   Subsequently, samples with less than 200 cells were removed and the
   gene expression for each cell was normalised to the total expression
   using the “NormalizeData” function found in the Seurat package.

   We next applied our FastIntegration algorithm to integrate the retained
   data samples. For each sample, we first identified the top 3000 highly
   variable features. Thereafter, we merged the lists from all samples and
   selected the top 3000 most common ones. These highly variable features
   were used to identify the anchors between samples and the anchors were
   then used for the subsequent integration and batch correction steps.
   After integration, the batch-corrected gene expression values were
   standardised using the “ScaleData” Seurat function and Principal
   Component Analysis (PCA) was performed. The first 30 PCs were then used
   for Uniform Manifold Approximation and Projection (UMAP) to enable
   visualisation. For clustering, we built the KNN graph based on the
   Euclidean distance in PCA space and applied the Louvain algorithm. The
   Wilcoxon rank sum tests were used for identifying differentially
   expressed genes (DEGs) in each cluster, which were subsequently
   utilised for manual cell type annotation.

Identification of age-related cell types

   We selected three projects ([139]GSE185381, [140]GSE120221, and
   HCA_HematopoieticImmuneCellAtlas) that had unsorted data and sample
   sizes larger than 5. For each cell type, we performed linear regression
   on the proportion of each cell type with age with the following model:
   [MATH:
   <mrow><msub><mi>x</mi><mi>j</mi></msub><mo>=</mo><msub><mi>β</mi><mn>0<
   /mn></msub><mi>A</mi><mi>g</mi><msub><mi>e</mi><mi>j</mi></msub><mo>+</
   mo><msub><mi>β</mi><mn>1</mn></msub><mi>P</mi><mi>r</mi><mi>o</mi><mi>j
   </mi><mi>e</mi><mi>c</mi><msub><mi>t</mi><mi>j</mi></msub></mrow>
   :MATH]

   where x[j] is the percentage in sample j. A Spearman correlation p
   value was computed for each project and an overall p value for the
   linear model.

Age-related gene identification and enrichment analysis

   To identify age-correlated genes, we averaged the batch-corrected gene
   expression values of each cell type in each sample. We only considered
   cell types found in at least 20 samples and with at least 20 cells.
   Only the genes expressed in > 10% cells were used. Subsequently, we
   correlated gene expression with age using the Pearson correlation and
   retaining genes with p value less than 0.01 as age-related genes. For
   gene set enrichment analysis, we used Enrichr ([141]29) with gene sets
   from the KEGG, GO, Reactome, and TRUST databases. Pathways with an
   adjusted p value less than 0.01 were selected for visualisation. We
   also compared our identified age-related genes with the gene list in
   GenAge's database ([142]42) and genes identified in epigenetic clocks
   ([143]43).

Age prediction

   Based on the identified age-related genes, we trained a model to
   predict the age of a sample. As the age-related genes we identified
   were linearly correlated with age, we applied elastic net linear
   regression from the glmnet package ([144]44) to build an age prediction
   model for each cell type. The performance of this model was evaluated
   using 10-fold cross validation ([145] Supplementary Figure 5 ), and the
   median predicted age of all cell types was taken as the predicted age
   for the sample. To estimate the model’s accuracy, we calculated the
   median absolute difference between the predicted age and chronological
   age. Prior to predicting the age of new cell samples, we first mapped
   them to the reference atlas and using only the cells that were
   confidently mapped for the age prediction.

Mapping of diseased samples to bone marrow atlas

   We downloaded single-cell RNA-Seq data of acute myeloid leukaemia and
   multiple myeloma samples from the DISCO database. For each disease, we
   first integrated the data of all samples using FastIntegration.
   Subsequently, another round of integration was performed to integrate
   the diseased samples together with the healthy bone marrow atlas with
   the latter serving as the reference. Finally, we used the integrated
   data for PCA, UMAP generation, and clustering. We annotated the disease
   sample cells based on the most prevalent cell type among its 30 nearest
   neighbours in the reference atlas. We then compute a confidence score
   by taking the inverse of the distance between each cell and its 30
   nearest neighbours, and then normalising it to a range of 0 to 1. The
   confidence scores were fit into a two-component Gaussian mixture
   distribution, with cells in the first component being deemed as
   confidently assigned.

Data availability statement

   The original contributions presented in the study are included in the
   article/[146] Supplementary Materials . Further inquiries can be
   directed to the corresponding author. The scripts used in the study can
   be downloaded from the Github repository:
   [147]https://github.com/JinmiaoChenLab/Bone_Marrow_Aging.

Author contributions

   JC conceptualised and supervised the study. ML and NL analysed the
   data, annotated the cell types, developed the age predictor, and
   generated the figures and tables. KA, NL, JC, and ML wrote the
   manuscript. All authors contributed to the article and approved the
   submitted version.

Acknowledgments