Abstract Introduction Ageing in the human bone marrow is associated with immune function decline that results in the elderly being vulnerable to illnesses. A comprehensive healthy bone marrow consensus atlas can serve as a reference to study the immunological changes associated with ageing, and to identify and study abnormal cell states. Methods We collected publicly available single cell transcriptomic data of 145 healthy samples encompassing a wide spectrum of ages ranging from 2 to 84 years old to construct our human bone marrow atlas. The final atlas has 673,750 cells and 54 annotated cell types. Results We first characterised the changes in cell population sizes with respect to age and the corresponding changes in gene expression and pathways. Overall, we found significant age-associated changes in the lymphoid lineage cells. The naïve CD8^+ T cell population showed significant shrinkage with ageing while the effector/memory CD4^+ T cells increased in proportion. We also found an age-correlated decline in the common lymphoid progenitor population, in line with the commonly observed myeloid skew in haematopoiesis among the elderly. We then employed our cell type-specific ageing gene signatures to develop a machine learning model that predicts the biological age of bone marrow samples, which we then applied to healthy individuals and those with blood diseases. Finally, we demonstrated how to identify abnormal cell states by mapping disease samples onto the atlas. We accurately identified abnormal plasma cells and erythroblasts in multiple myeloma samples, and abnormal cells in acute myeloid leukaemia samples. Discussion The bone marrow is the site of haematopoiesis, a highly important bodily process. We believe that our healthy bone marrow atlas is a valuable reference for studying bone marrow processes and bone marrow-related diseases. It can be mined for novel discoveries, as well as serve as a reference scaffold for mapping samples to identify and investigate abnormal cells. Keywords: bone marrow, single-cell atlas, ageing, inflammageing, data integration Introduction The human immune system plays a crucial role in fending off challenges from viruses and microbes, as well as malignancies. As an individual ages, the immune system ages alongside, characterised by immune cell population size changes, functional capability alterations, and mutation accumulation ([29]1). These degradations of the immune system in turn increase the risks of infections and cancers ([30]2, [31]3). Chronic, low-grade inflammation, or inflammageing, also develops with age and is a risk factor for various diseases including diabetes mellitus and cardiovascular diseases ([32]4, [33]5). As the bone marrow (BM) is the main source of new immune cells, it is important to study ageing-related changes in the bone marrow and how they contribute to the weakening of the immune system. To date, most immune ageing studies employed flow cytometry to analyse peripheral blood samples ([34]1). Flow cytometry has high throughput and is low cost compared to sequencing experiments but can only measure 20+ parameters, or 40+ parameters for destructive mass cytometry. To probe transcriptome-wide changes, sorted cells can be subjected to bulk sequencing but this limits the analysis of differences to predefined cell types. Mouse samples are also commonly used as they are the easiest to obtain. For human-based studies, the reported age-associated changes in peripheral blood cell population abundance include shrinking naïve T cell populations, increasing effector/memory and regulatory T cell populations, shrinking B cell subsets, and increasing monocyte populations ([35]1). Overall, there is also a clonal shift towards myeloid-biased hematopoietic stem cells (HSCs), which result in a skew towards circulating myeloid populations ([36]6). In the past decade, advances in single-cell technologies have lowered costs while increasing the scale of data generated. This has spurred an increasing number of studies exploiting single-cell RNA sequencing (scRNA-Seq) to profile different tissues including the bone marrow. The consequent explosion of publicly available data affords us opportunities to construct large scale cell atlases from a wide array of sample datasets. To date, several human bone marrow atlases have been constructed. Most incorporate both healthy and diseased samples to investigate the differences between them with only a handful of studies that focus on healthy bone marrow. Hay et al. sequenced over 100,000 cells from eight healthy donors, spanning 35 annotated cell types ([37]7). They also characterised the immunological differences due to gender and age. They detected minimal gender-specific differences but HSC frequency was found to reduce with age. However, the low number of healthy donors makes it difficult to establish strong and generalisable conclusions. The Human Cell Atlas also hosts a larger updated census of more than half a million immune cells from samples that include bone marrow and umbilical cord blood ([38]8). However, there currently appears to be no associated analysis published. Here we present our healthy bone marrow atlas constructed with 145 publicly available scRNA-Seq datasets from 22 studies. As a single cell resource, it can be mined to gain insights into healthy bone marrow tissue and serve as a reference onto which we can map disease samples to investigate disease pathology. We first employed it to investigate ageing-related changes in the bone marrow. We identified cell populations that proportionally change with age and the related changes in gene expression and associated pathways. We then trained an age predictor model with cell type-specific ageing gene signatures to investigate the apparent age of disease samples with respect to their chronological age. There we found acute myeloid leukaemia (AML) samples to have a lower apparent age for more aged samples while younger samples had a higher predicted age. Finally, we mapped blood cancer samples onto the atlas to identify abnormal cell types and associated transcriptomic changes. We were able to identify abnormal plasma cells and erythroblasts in multiple myeloma (MM) samples, and abnormal cells in acute myeloid leukaemia samples. Results Construction of human healthy bone marrow cell atlas and its application in identifying cells and gene signatures associated with ageing and diseases To construct our reference healthy human BM atlas, we used 145 publicly available human BM scRNA-Seq datasets with publicly available sequencing reads from 22 projects ([39] Figure 1A ). Of the 145 samples, 92 samples have age information, with age ranging from 2 to 84 years and a median of 45 years ([40] Supplementary Table 1 ). Our collected data also includes 34 foetal BM samples. From samples with gender information, the gender ratio is relatively balanced at 54 female and 45 male samples. 91 samples were sorted using a variety of strategies. All samples were sequenced using the 10x Genomics sequencing platforms. Figure 1. [41]Figure 1 [42]Open in a new tab Healthy BM atlas construction, annotation, and analysis. (A) Integration and analysis of publicly available healthy BM samples to construct the atlas. 145 healthy BM samples were used to build the final atlas with 673,750 cells in 54 cell types. Blood disease samples of acute myeloid leukaemia and multiple myeloma were projected onto the healthy BM atlas to identify disease-specific populations. Age-correlated cell type frequency changes and associated dysregulated genes and pathways were identified. Identified age-associated gene signatures were then used with machine learning to predict cell age. (B) UMAP of the integrated healthy BM atlas with cells coloured by their cell type annotation. For an interactive version of this atlas, please visit DISCO at [43]http://www.immunesinglecell.org/atlas/bone_marrow. The raw reads were previously processed as part of the DISCO database ([44]9). We employed our pipeline to remap the reads to a single reference genome, GRCh38 (Ensembl 93), for annotation consistency and to reduce potential technical effects. We then processed the resulting read counts using the Seurat package ([45]10). Standard quality control steps on UMI counts, number of detected genes, and the fraction of mitochondrial and ribosomal genes were applied to filter out low quality cells. To remove the significant batch effects present, we employed our FastIntegration tool developed for atlas-scale integration ([46]11) ([47] Supplementary Figure 1 ). This was followed by unsupervised clustering and differential gene expression analysis. We then annotated each cluster’s cell type using canonical marker genes ([48] Supplementary Figure 2 and [49]Supplementary Table 2 ). Our constructed healthy BM atlas is currently the largest among existing healthy BM atlases ([50]7, [51]12, [52]13) in both cell count and number of donor samples. It consists of 673,750 cells with 54 annotated cell types ([53] Figure 1B ). The cell types present mostly overlap with the previously constructed healthy BM atlas by Hay et al. ([54]7) but are annotated at higher resolution of cell subtypes with known markers. Our annotated cell types can be divided into five major groups: T/NK cells, B cells, monocytes and DCs, progenitor cells, and erythrocytes/megakaryocytes. Other smaller clusters of cells identified are osteoclasts, fibroblast, and mesenchymal stromal cells. For the T/NK cells, we could divide the naïve population into CD4^+ and CD8^+ subtypes, the CD8^+ T cells into GZMK and GZMB subtypes, and NK cells into CD16^+ and CD56^+ subtypes. We also identified mucosal-associated invariant T (MAIT), gamma-delta T, and regulatory T (Treg) cells. For B cells, we could identify the subtypes along its developmental path from the common lymphoid progenitors (CLP) to pro-B, pre-B, naïve B, memory B, and plasma cells. The monocyte populations found were the CD14^+ and CD16^+ subtypes, while the dendritic cells were divided into classical dendritic cells (cDCs) and plasmacytoid dendritic cells (pDCs). We compared our annotations with available annotations of the contributing samples ([55]GSE185381 ([56]14) and Census of Immune Cells ([57]8)) and we found good concordance among the major cell types ([58] Supplementary Figure 3 ). Using the unsorted samples, we computed the distributions of cell type proportions ([59] Figure 2A ). Mature T cells formed the majority of cells in the bone marrow with naïve CD4^+ T cells being the most numerous. Naive CD8^+ T cells expectedly showed a much lower average percentage with this skew being attributed to the high failure rate of CD8^+ T cells during selection in the thymus ([60]15). Other cell types, namely monocytes, B cells, and NK cells, made up significant fractions. Overall, these proportions are in line with previous studies on the cell types present in the bone marrow ([61]13). Figure 2. [62]Figure 2 [63]Open in a new tab Detailed cell type populations and annotation. (A) Percentage of cells for each cell type in unsorted bone marrow samples. (B) pDC development pathways of myeloid and lymphoid origins. (C) Treg cell population identified by canonical markers. (D) Osteoclast/macrophage cells identified by canonical markers. By integrating a large number of cells into a single atlas, there are sufficient cells to populate the secondary developmental pathways. Here our atlas illustrates the different developmental pathways of pDCs (marked by IRF7), namely myeloid and lymphoid origins ([64]16) ([65] Figure 2B ). The myeloid origin pathway (blue) traces from the DC precursor populations and is marked by the myeloid-associated LYZ. The lymphoid origin pathway (green) traced from the CLP population and is marked by JCHAIN and HIST1H4C. Within the T cell population, we could also distinguish the Treg cell subset ([66] Figure 2C ). The large cell count enabled us to increase the resolution of clustering and identify rare cell types in the bone marrow, such as osteoclasts ([67] Figure 2D ) and mesenchymal stromal cells. These cells were not annotated in previous healthy bone marrow atlases. As the integration included foetal data samples, we could identify foetal erythrocytes expressing HBG1 and HBG2 (foetal HBG^+ erythrocyte). Bone marrow atlas captures age-related changes in immune cell populations In ageing studies, blood cell population changes are typically analysed using peripheral blood samples. We first examined cell type frequency changes in the bone marrow and compared them against those reported in human blood samples. Due to large variances in cell populations across samples and studies, we only used samples from three studies, namely [68]GSE120221 with 25 samples ([69]13), [70]GSE185381 with 10 samples ([71]14), and the Census of Immune Cells dataset with 8 samples ([72]8). We selected these studies as they had large numbers of unsorted samples with age information. The remaining studies were either composed of sorted samples or had too few donor samples (<5). We tested the correlations for each annotated cell type using a linear model with the study batch as covariate. Two cell types, effector/memory CD4^+ T and CD16^+ NK cells, had statistically significant positive correlations across all samples ([73] Figure 3A , [74]Supplementary Table 3 ). The effector/memory CD4^+ T cell’s correlation was statistically significant when combining the regression output from all three studies and even within the [75]GSE120221 and Census study sets, giving us the greatest confidence in this result. Moreover, this correlation has also been found in peripheral blood by Li et al. ([76]17). Increasing proportions of CD16^+ NK cells have also been reported in the blood of elderly individuals ([77]18). Figure 3. [78]Figure 3 [79]Open in a new tab Analyses of age-correlated changes in the bone marrow. (A) Regression plots showing significant changes in cell type frequencies with respect to age for naïve CD8^+ T cell, effector/memory CD4^+ T cell, and common lymphoid progenitor in the bone marrow. (B) UpSet plot with set size depicting the number of significant age-related DEGs for each of the respective cell types and intersection size illustrating the number of age-related DEGs unique to a cell type or shared with other cell types depending on which cells have been filled in the respective columns. (C) Regression plots showing the upregulation of sample averaged RPL13A expression for multiple cell types across age. (D) Regression plots of sample averaged expression of GADD45 and ZBTB16. (E) Heatmap of age-associated dysregulated pathways in different cell types. Among the annotated cell types negatively correlated with age, the naïve CD8^+ T cell population had the largest average coefficient and was consistent across the three studies. Decline in naïve T cell populations have been reported for peripheral blood in different studies ([80]17, [81]19, [82]20) and is well recognised. For the corresponding naïve CD4^+ T cell population, we obtained a negative correlation with the [83]GSE120221 and [84]GSE185381 sets, but a positive correlation with the Census study set ([85] Supplementary Figure 4 ). The latter result is unexpected and can be explained by the small number of samples (n=8) coupled with high variability within the data that can mask the expected age-related decline. Another notable cell type negatively correlated with age is the CLP population. As progenitor cell types, changes in these populations have disproportionate impact on mature cell type populations and consequently the immune system’s functions. This decline in the proportion of lymphoid-biased progenitors and consequent myeloid bias in haematopoiesis is widely reported ([86]21). The computed correlations for all cell types are given in [87]Supplementary Table 3 . Cell type-specific ageing gene signature and pathway changes correlated with ageing We next investigated the cell type-specific genes with expression that correlate with age, using all samples with age metadata ([88] Figure 3B ). Among the annotated cell types, GMP cells had the largest number of age-associated differentially expressed genes (DEGs) with 663 genes, while IFN-activated T cells had the least with 107 genes. Most of these genes were shared among few cell types, with only 20 genes shared in 10 or more cell types. The most conserved DEG was RPL13A in 17 cell types ([89] Figure 3C ). RPL13A’s upregulation with age has also been reported for almost all tissue types of mice ([90]22). Other genes of note include IMMP2L and DIP2B which showed downregulation with age in 10 cell types. Switching off IMMP2L signalling has been shown to drive cell senescence ([91]23). Similarly, DIP2B knockout cells expressed senescence markers found in ageing cells ([92]24). We also found positive age correlations for chemokines CCL4, CCL4L2, and CCL5 in tissue-resident NK cells, CCL4L2 in CD16^+ NK cells, CCL4 in CD56^+ NK cells, and CCL3 in GZMK^+ CD8^+ T cells ([93] Supplementary Table 4 ). With ageing, the skew towards myeloid lineage haematopoiesis is well documented ([94]25). Within the MPP population, we investigated transcriptomic changes that can contribute to the decline in differentiation towards the lymphoid lineage. Here we found two notable genes, GADD45B which was upregulated with respect to age, and ZBTB16 that was downregulated ([95] Figure 3D ). GADD45B has been characterised as a myeloid differentiation gene and plays a role in the response of myeloid cells to stress stimulation ([96]26, [97]27), while the ZBTB16 gene is a negative regulator of myeloid cell development ([98]28). We followed up with pathway analysis of the age correlated genes using EnrichR ([99]29). Here we show the enriched Reactome pathways ([100] Figure 3E ). The different cell types show a wide range of dysregulated pathways. We note that the CD14^+ MHCII^low monocytes, late haemoglobin^+ erythrocytes, memory B cells, and pre-B cells similarly show dysregulated translation processes. Other B cell subtypes, namely pro-B cells and naïve B cells, show dysregulated mRNA processing pathways, and the naïve B cell subset also show dysregulated cellular stress pathways. For the interferon-activated T cells, the dysregulated pathways are primarily centred around metabolism and interferon signalling. The metabolic changes in T cells due to ageing is linked to cell senescence and reduced functionality, though the exact mechanisms are under investigation ([101]30). Age prediction of disease samples show divergence from chronological age Cancer is generally recognised as an ageing-related disease. While ageing increases the risk of cancer, it has also been reported that cancer can alter the expression trends of ageing-related genes ([102]31). We theorised that diseases including cancer can modify the gene expression of diseased cells to appear younger or older than their chronological age. Thus, we investigated how diseased cells’ age predicted by their transcriptome differ from their chronological age. We first trained a cell age predictor using the healthy atlas. The predictor was constructed using elastic net regression and the age-related DEGs identified for each cell type. For each sample, we computed the age of each cell type present, and the median predicted age was used as the overall predicted age. The feature gene set were selected via regularisation in the elastic net regression, and we employed 10-fold cross validation to check the hyperparameters ([103] Supplementary Figure 5 ). The resulting predictor was able to predict the ages of healthy individuals with a correlation coefficient of 0.92 and an error of 6.3 years ([104] Figure 4A ). We further tested our predictor using the corresponding cell types from healthy blood samples, obtaining good predictions with CD14^+ monocytes and Treg cells ([105] Figure 4B ). List of blood samples are presented in [106]Supplementary Table 5 . Figure 4. [107]Figure 4 [108]Open in a new tab Age prediction using identified age-associated DEGs with machine learning. (A) Regression plots showing the correlation between chronological age and median predicted age of the respective cell types in healthy BM samples. (B) Regression plots showing the correlation between chronological age and median predicted age of select peripheral blood cell types. (C) Heatmap depicting the differences (error) between chronological age and predicted age of diseased BM samples. We then used the age predictor to predict the age of patients diagnosed with acute myeloid leukaemia (AML) and multiple myeloma (MM). The predicted ages of the cancer samples showed the trends of older patients being predicted to have a younger age while younger patients were predicted to have an older age ([109] Figure 4C and [110]Supplementary Table 6 ). For the AML samples, the predicted ages fall in the range of 32 to 51 years old with an average of 41.8. For the MM samples, the predicted ages fall in the range of 49 to 60 years old with an average of 53. This suggests that cancer alters gene expression patterns to portray a cancer-specific apparent age that may be different across different cancer types. In the case of AML, this also affects the paediatric disease subtypes. Future investigation into this apparent cell age phenomenon will encompass other blood cancers. Blood cancer-specific cells identified by mapping disease samples onto atlas To demonstrate our healthy BM atlas’ applicability as a reference for identifying diseased cells, we mapped samples of AML and MM onto our atlas. By integrating healthy and diseased samples together, abnormal cells can be identified. We first mapped 10 MM samples from [111]GSE189460 onto our healthy atlas and performed label transfer to annotate the cells ([112] Figure 5A ). We also computed a cell type prediction confidence score based on each cell’s distance to its neighbours in the reference atlas (Methods). Lower prediction confidence scores denote mapped cells that were phenotypically different from cells found in the healthy atlas, which in this case implied diseased cells. Among the mapped cells, the lowest scores were found among the predicted plasma and erythroblast cells ([113] Figure 5B ). The predicted plasma cell types correspond to the malignant plasma cells that accumulate in the bone marrow while abnormalities in the erythroblast compartment correspond to the disrupted erythropoiesis process which gives rise to the common MM symptom of anaemia ([114]32). Figure 5. [115]Figure 5 [116]Open in a new tab Projecting diseased BM samples onto healthy reference to identify disease-specific cells. (A) Multiple myeloma datasets projected onto the healthy BM atlas. (B) Confidence scores of cell type prediction of MM samples. (C) Heatmap of DEGs between plasma cells from MM samples versus healthy samples. (D) Pathway enrichment analysis of MM plasma cell DEGs. (E) Projection of AML datasets onto the healthy BM atlas. (F) Confidence scores of cell type prediction with AML samples. We compared the malignant plasma cells to their healthy counterparts and obtained 80 DEGs (p< 0.05 and |logFC| > 0.5) ([117] Figure 5C and [118]Supplementary Table 7 ). The upregulated genes include the previously reported JUND ([119]33), FOSB, PDIA2 ([120]34), and CCL3 ([121]35). Elevated levels of CCL3 in the bone marrow is of particular interest as it has been proposed to suppress erythropoiesis and cause anaemia in MM patients ([122]36). In the list of downregulated DEGs, low CD27 expression in malignant plasma cells has been reported to be correlated with poor prognosis ([123]37) while the tumour suppressor gene WWOX has been found downregulated due to translocations or deletions ([124]38). In the enrichment analysis, dysregulated pathways include TP53 and cytokine signalling ([125] Figure 5D ). The dysregulation of cytokine signalling is an expected characteristic of malignant plasma cells due to the high levels of cytokines that also serve as chemoattractants that attracts them to the bone marrow ([126]39). We next mapped 10 AML samples from [127]GSE185381 ([128]14) onto our BM atlas ([129] Figure 5E ). We performed label transfer to label the AML sample cells and computed the prediction confidence score for each cell. Comparison of the computed scores with the malignant cell labels in the original annotation showed good correlation with the malignant cells have low scores ([130] Figure 5F ). The malignant cells’ transferred labels are primarily progenitor types and myeloid types (monocytes and dendritic cells), which have been previously reported ([131]40). This presents an approach towards identifying malignant cells as an alternative to employing mutation detection. Discussion The bone marrow is the site of haematopoiesis, a highly regulated process that must be responsive to the body's needs. Like other bodily processes, haematopoiesis is profoundly affected by ageing. Well described by multiple studies, the ageing process brings about many changes to the cell populations in the bone marrow, including shrinking naïve cell populations, accumulating memory cell populations, and myeloid bias in cellular output ([132]1). In this work, we constructed a reference healthy BM atlas using published scRNA-Seq datasets with publicly available reads. The samples’ ages spanned the lifespan of most individuals, from 2 to 84 years of age, with 33 additional foetal samples. We first investigated age-related changes in the BM. We recapitulated some of the reported cell population changes with respect to age and identified accompanying gene expression changes. Among the age-correlated genes, most were cell type-specific with only a small number being shared among the different cell types. Notably, we found the downregulation of IMMP2L and DIP2B, which have been implicated in cell senescence. We also found chemokines upregulated in different NK cell subtypes. Future work will aim to refine the ageing-associated genes and pathways, and investigate the mechanisms linking to reduced immune functions. In the BM atlas constructed by Human Cell Atlas consortium ([133]7), HSC cell frequency decreased with age. This was not replicated in our analysis. This discrepancy may be explained by HSCs representing a very small percentage of each sample (<4%), making the measured frequency prone to errors. Thus, a much larger number of samples and larger cell counts per sample are needed to reduce the error margins and verify this observation. Moreover, the HCA atlas used only eight samples from donors between the ages of 25 and 53, which limited the study’s statistical power. We also employed the cell type-specific genes that were correlated with age to train a cell age predictor. With the predictor, we found chronologically younger blood cancer samples to have an older predicted age while older blood cancer samples showed a younger predicted age. As cancer cells possess characteristics of immortality, active cell division, and higher metabolic activity, they can appear phenotypically younger. For chronologically younger samples, the genomic and even environmental changes associated with cancer may also alter gene expression to seemingly reflect an older cellular age. Higher stress levels than healthy tissue, cellular dysregulation, and activated but ineffectual DNA repair pathways can be contributing factors as well. Most disease studies construct cell atlases that combine both healthy and diseased samples to study the differences between them. However, a comprehensive healthy tissue atlas is still valuable to disease studies. As recently demonstrated by Dann et al. ([134]41), using a separate healthy reference as a reference scaffold to map disease samples and matched healthy controls can improve the identification of disease-associated cell states and reduce the number of control samples while preserving the rate of false discoveries. Here, our constructed healthy BM atlas encompassing a large number of donor samples from multiple studies and across a wide age range, is a comprehensive healthy reference that can serve as a baseline for comparative studies with diseased samples. As bone marrow samples require an invasive procedure to obtain, using our healthy atlas to reduce the needed number of healthy control samples is greatly beneficial. For this work, we demonstrated the mapping of disease samples onto our atlas to identify diseased cell states. We identified abnormal plasma cells and erythroblasts in MM samples, as well as AML-specific abnormal cells. The diseased cells can be easily identified by their low cell type prediction confidence scores. We further inspected the differentially expressed genes of the MM plasma cells and identified previously reported disease markers. We believe that our BM atlas is a valuable reference for studying healthy bone marrow processes and bone marrow-related diseases. As more BM scRNA-Seq data become available, we will continuously update and improve on our atlas currently available in the DISCO ([135]11) atlas collection: [136]http://www.immunesinglecell.org/atlas/bone_marrow. We will also pursue future studies on bone marrow diseases using our atlas as the reference scaffold. Methods Single-cell RNA-Seq data collection, integration, and annotation We retrieved healthy bone marrow sample datasets from the DISCO database ([137]9). The datasets were previously preprocessed from raw reads and mapped onto the human reference genome, GRCh38 (Ensembl 93). Except for data integration with FastIntegration, we employed the Seurat package ([138]10) for downstream data analyses. For each sample, we filtered the cells based on their unique molecular identifier (UMI) counts, the number of detected genes, and the fraction of mitochondrial and ribosomal genes. As the data were acquired in different experiments, we utilised the distribution of QC metrics from each sample to manually determine the cut-offs applied for filtering. Subsequently, samples with less than 200 cells were removed and the gene expression for each cell was normalised to the total expression using the “NormalizeData” function found in the Seurat package. We next applied our FastIntegration algorithm to integrate the retained data samples. For each sample, we first identified the top 3000 highly variable features. Thereafter, we merged the lists from all samples and selected the top 3000 most common ones. These highly variable features were used to identify the anchors between samples and the anchors were then used for the subsequent integration and batch correction steps. After integration, the batch-corrected gene expression values were standardised using the “ScaleData” Seurat function and Principal Component Analysis (PCA) was performed. The first 30 PCs were then used for Uniform Manifold Approximation and Projection (UMAP) to enable visualisation. For clustering, we built the KNN graph based on the Euclidean distance in PCA space and applied the Louvain algorithm. The Wilcoxon rank sum tests were used for identifying differentially expressed genes (DEGs) in each cluster, which were subsequently utilised for manual cell type annotation. Identification of age-related cell types We selected three projects ([139]GSE185381, [140]GSE120221, and HCA_HematopoieticImmuneCellAtlas) that had unsorted data and sample sizes larger than 5. For each cell type, we performed linear regression on the proportion of each cell type with age with the following model: [MATH: xj=β0< /mn>Agej+β1Proj ectj :MATH] where x[j] is the percentage in sample j. A Spearman correlation p value was computed for each project and an overall p value for the linear model. Age-related gene identification and enrichment analysis To identify age-correlated genes, we averaged the batch-corrected gene expression values of each cell type in each sample. We only considered cell types found in at least 20 samples and with at least 20 cells. Only the genes expressed in > 10% cells were used. Subsequently, we correlated gene expression with age using the Pearson correlation and retaining genes with p value less than 0.01 as age-related genes. For gene set enrichment analysis, we used Enrichr ([141]29) with gene sets from the KEGG, GO, Reactome, and TRUST databases. Pathways with an adjusted p value less than 0.01 were selected for visualisation. We also compared our identified age-related genes with the gene list in GenAge's database ([142]42) and genes identified in epigenetic clocks ([143]43). Age prediction Based on the identified age-related genes, we trained a model to predict the age of a sample. As the age-related genes we identified were linearly correlated with age, we applied elastic net linear regression from the glmnet package ([144]44) to build an age prediction model for each cell type. The performance of this model was evaluated using 10-fold cross validation ([145] Supplementary Figure 5 ), and the median predicted age of all cell types was taken as the predicted age for the sample. To estimate the model’s accuracy, we calculated the median absolute difference between the predicted age and chronological age. Prior to predicting the age of new cell samples, we first mapped them to the reference atlas and using only the cells that were confidently mapped for the age prediction. Mapping of diseased samples to bone marrow atlas We downloaded single-cell RNA-Seq data of acute myeloid leukaemia and multiple myeloma samples from the DISCO database. For each disease, we first integrated the data of all samples using FastIntegration. Subsequently, another round of integration was performed to integrate the diseased samples together with the healthy bone marrow atlas with the latter serving as the reference. Finally, we used the integrated data for PCA, UMAP generation, and clustering. We annotated the disease sample cells based on the most prevalent cell type among its 30 nearest neighbours in the reference atlas. We then compute a confidence score by taking the inverse of the distance between each cell and its 30 nearest neighbours, and then normalising it to a range of 0 to 1. The confidence scores were fit into a two-component Gaussian mixture distribution, with cells in the first component being deemed as confidently assigned. Data availability statement The original contributions presented in the study are included in the article/[146] Supplementary Materials . Further inquiries can be directed to the corresponding author. The scripts used in the study can be downloaded from the Github repository: [147]https://github.com/JinmiaoChenLab/Bone_Marrow_Aging. Author contributions JC conceptualised and supervised the study. ML and NL analysed the data, annotated the cell types, developed the age predictor, and generated the figures and tables. KA, NL, JC, and ML wrote the manuscript. All authors contributed to the article and approved the submitted version. Acknowledgments