Abstract Dysregulation of early B cell lymphopoiesis—the process guiding cellular immunity development—can lead to malignancy, making it crucial to understand its regulatory mechanisms. We generated a multiomics resource comprising paired chromatin accessibility and gene expression profiles across eight human B cell precursor populations, providing a detailed characterization of early human B cell development. Integrative analysis revealed highly cell type–specific regulatory elements and enabled the reconstruction of the gene regulatory network governing differentiation. We identified putative candidate regulons, such as ELK3, enriched in pro–B cells and potentially involved in cell cycle progression. Regulons from bulk data were projected onto single-cell data, validating their activity and refining the regulatory landscape. This resource enabled identification of active regulatory programs and transformation-associated states in B cell acute lymphoblastic leukemia. The publicly available atlas provides a valuable resource for understanding B cell development and disease, supporting future efforts to decode regulatory programs in immunity and hematologic malignancies. __________________________________________________________________ Regulatory landscape of human B cell lymphopoiesis provides insights into the pathogenesis of B cell acute lymphoblastic leukemia. INTRODUCTION B cell lymphopoiesis gives rise to a diverse repertoire of peripheral B cells, antibody-producing cells capable of mediating protection against pathogens while remaining tolerant of self-antigens. Before migration to the periphery, B cell development occurs in the bone marrow following a complex process governed by the sequential rearrangement of the immunoglobulin (Ig) heavy and light chain genes and their expression and subsequent assembly of their encoded proteins into the B cell receptor (BCR) ([72]1, [73]2). This highly ordered and controlled process is divided into several cell stages or B cell subpopulations ([74]3, [75]4). The transcriptional profile of each cell subpopulation emerges from an underlying gene regulatory network (GRN) in which a limited number of transcription factors (TFs) and cofactors regulate each other and their downstream target genes ([76]5). Several TFs orchestrating the B cell development have been described ([77]2, [78]6). Key lineage TFs, such as IKZF1 (Ikaros family zinc finger 1) and PU.1, are implicated in the initial commitment toward the lymphoid lineage ([79]7, [80]8). PAX5 (paired box 5), EBF1 (early B cell factor 1), and E2A [TCF3 (transcription factor 3)] are regulators of B cell lineage commitment ([81]9–[82]11), while IRF4 (interferon regulatory factor 4), IRF8, and FOXO1 (forkhead box O1) are generally associated with Ig rearrangement ([83]12, [84]13). Last, OCT2 [POU2F2 (POU class 2 homeobox 2)] and OBF1 (OCT-binding factor 1) are crucial TFs that drive the migration of immature B cells to the periphery ([85]14, [86]15). Most of this regulatory knowledge is derived from studies in mice, and thus, it is unclear whether the mechanisms identified follow the same regulatory programs as those of humans ([87]16). The integrated analysis of paired chromatin accessibility and gene expression profiling across various cell types offers a powerful approach to reconstructing GRNs and identifying cell type–specific regulons (TFs and their associated set of targets) ([88]17–[89]19), a crucial approach to deciphering the regulation programs in humans. Such an analysis involves predicting cis-regulatory elements (CREs), identifying potential TFs, and linking the CREs to their candidate target genes. This strategy was followed to provide initial insights into the cis-regulation of mouse hematopoiesis ([90]19), capturing the key TFs committing the B cell lineage, and in the study of mixed-phenotype acute leukemia, revealing RUNX1 (runt-related transcription factor 1) as a putative regulator ([91]20). More recent advances in multiomics single-cell approaches have allowed GRN inference, applied to study human peripheral blood mononuclear cells ([92]17) and plasma cell differentiation in the human tonsil ([93]21). Furthermore, a detailed characterization of B cell subpopulations in mice and human lymphopoiesis has been obtained using single-cell technology approaches ([94]3, [95]4, [96]22, [97]23). Despite notable progress, the understanding of the regulatory mechanisms of this process remains incomplete, with current single-cell multiomics experiments providing insufficient depth (detailed profiling) and breadth (enough individuals) to conduct detailed epigenetic profiling of the early stages of human B cell differentiation. To address this gap and to decipher the GRN of early B cell differentiation in humans, we generated comprehensive chromatin accessibility and transcriptional profiles corresponding to the stepwise progression of B cell precursors (BCPs). Our study included eight cell subpopulations across 13 healthy donors. A detailed joint analysis of paired high-throughput data [bulk RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin using sequencing (ATAC-seq)] allowed the generation of a comprehensive GRN of B cell differentiation, capturing cell type specificity and dynamism of regulons, also recovering the master regulators of this process. Our resource not only recapitulated well-known regulatory elements but also revealed candidate regulators of early B cell differentiation, such as ELK3, which appears to be associated with pro–B cell cycle progression. Next, we further explored the inferred regulatory programs at a single-cell resolution, which confirmed the observed regulatory dynamics and enabled a finer-grained definition of regulon-based early B cell differentiation. Last, we leveraged this detailed characterization of normal B cell development to deconvolve B cell malignancies, revealing transformation-associated cellular states and highlighting subtype-specific regulatory programs in certain forms of B cell acute lymphoblastic leukemia (B-ALL). This detailed and extended portrait of regulatory elements has been made publicly available in a user-friendly web interface for browsing (B-rex; [98]https://translationalbio.shinyapps.io/brex/), representing a rich playground to further unravel the regulatory mechanisms of B cell differentiation in health and disease. RESULTS Profiling of early human B cell differentiation stages by RNA-seq and ATAC-seq To investigate early human B cell differentiation, we harvested bone marrow and peripheral blood from the same healthy individuals [n = 13; mean age, 21.4 (20 to 23)]. Using fluorescence-activated cell sorting (FACS), we isolated hematopoietic stem progenitor cells (HSPCs), common lymphoid progenitors (CLPs), pro–B cells, pre–B cells, and immature B cells from each bone marrow sample and transitional B cells and naïve B cells (CD5+ and CD5−) from peripheral blood specimens ([99]Fig. 1A and fig. S1A). All these populations were characterized using paired transcriptomic (RNA-seq) and chromatin accessibility (ATAC-seq) assays (fig. S2A and table S1). Fig. 1. Characterization of early B cell differentiation through chromatin accessibility and gene expression profiling. [100]Fig. 1. [101]Open in a new tab (A) Major BCP subpopulations from the human bone marrow and peripheral blood. The marker strategy followed to isolate eight B cell subpopulations from 13 healthy individuals is shown. Markers strongly expressed by the given B cell subpopulation are depicted in black, whereas a lack of expression is shown in gray. (B) PCA of gene expression (79 samples), (C) chromatin accessibility (78 samples), and (D) TFBS activity score (78 samples). Each dot represents a sample colored by cell subpopulation. (E) Heatmap representation of TFBS activity score for key B cell regulators. Samples are shown by columns (n = 78) and grouped by cell subpopulation, and TFs are shown by rows (n = 12). (F) Representation of top 10 enriched GO terms derived from the transcriptional changes over B cell transitions. The NES is shown in a red-blue color range for each term and the −log[10] of the adjusted P value as dot size. Biological functions are grouped into four main terms. (G) Heatmap representation of the gene expression profile for 58 well-known genes associated with B cell lymphopoiesis. Samples are shown by columns (n = 79; grouped by cell subpopulation), and genes are shown by rows (n = 58; grouped by biological processes). Normalized gene expression levels are represented. (H) Heatmap representation of the gene set activity score (GSVA) computed for each human B cell subpopulation signature (n = 8; in rows) in each mouse sample (n = 17; in columns). The GSVA score represented in the heatmap is scaled by column to highlight the human B cell subpopulation with higher enrichment in each mouse B cell subpopulation (in dark blue). Principal components analysis (PCA) of both chromatin accessibility and gene expression data revealed consistent and well-defined agreement of cell types across individuals, effectively recapitulating the dynamic process of B cell differentiation ([102]Fig. 1, B and C). To dissect the TF-mediated regulatory mechanisms underlying this process, we first inferred the TF activity by identifying TF binding at accessible or open chromatin regions (OCRs). On the basis of these TF binding sites, we estimated the TF activity for each sample (TFBS activity score; see Materials and Methods), enabling the identification of highly active regulators within distinct cell subpopulations ([103]Fig. 1, D and E). The resulting TF activity profile showed clear cell type–specific clustering and recapitulated known regulatory circuits associated with B cell differentiation ([104]6). Among the transcriptional programs identified, pro–B cells and pre–B cells displayed the most distinct programs, underscoring key developmental transitions such as Ig rearrangement mediated by FOXO1 and IRF4/8, with further B cell commitment driven by EBF1 and PAX5. In HSPCs, stemness-associated TFs—GATA2, TAL1, and MEIS1—were highly active, while CLPs showed an increased activity of lymphoid-specifying TFs such as TCF3 and Ikaros (IKZF1). At later stages of differentiation, TFs involved in B cell maturation, including POU2F2 and SPIB, became prominent ([105]Fig. 1E and fig. S3 for the TFBS profile of all TFs). To further explore the molecular processes captured by our dataset, we examined biological pathways enriched in transcriptional changes over B cell transitions ([106]Fig. 1F and table S2), as well as the transcriptional and chromatin accessibility profiles of well-known markers of B cell differentiation ([107]Fig. 1G and [108]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex for the chromatin accessibility profile). We observed a significant enrichment in proliferation and differentiation pathways for CLPs that were also preserved in pro–B cells and pre–B cells ([109]Fig. 1F). These pathways, in the early stages of B cell differentiation, are mainly regulated by the interleukin-7 receptor signaling, highly expressed in CLPs, pro–B cells, and pre–B cells, to later switch to the antagonistic signaling by the pre–B cell receptor (pre-BCR), expressed in pre–B cells ([110]Fig. 1G) ([111]2, [112]4). For the transition from pro–B cells to pre–B cells, we observed an even higher enrichment of proliferation pathways ([113]Fig. 1F), associated with pre–B cell expansion (large pre-B or FrC′) controlled by the convergence of interleukin-7 receptor and the pre-BCR signaling ([114]24). We also identified enrichment for Ig heavy-chain and light-chain rearrangements, fundamental processes that organize B lymphocyte development. Here, we observed an enrichment of Ig-related pathways in the transition from CLPs to pro–B cells, a state when Ig rearrangement commences ([115]Fig. 1F), and observed increased gene expression for RAG1 and RAG2 in pro–B cells and pre–B cells ([116]Fig. 1G). The successful Ig rearrangements lead to BCR expression in immature B cells and their migration to peripheral blood, which is associated with a reduction in proliferation and an increase in migration pathways identified during the transition from pre–B cells to immature B cells ([117]Fig. 1F). With a wealth of knowledge on B cell differentiation derived from studies described in mice, we next examined the conservation and translatability of our human B cell transcriptional signatures (fig. S4) to those equivalent in mice (fig. S5) ([118]19). To achieve this, we performed gene set variational analysis (GSVA) to calculate gene set activity scores for each human B cell subpopulation signature across individual mouse samples. These unitless scores, where higher values indicate greater gene set activity, enabled robust cross-species transcriptional comparisons. We observed a general concordance between species ([119]Fig. 1H), further supporting the evolutionary conservation of transcriptional programs across human and mouse B cell differentiation and development. In summary, this comprehensive view of human B cell differentiation demonstrated well-preserved programs across individuals, aligning closely with those described in mice and highlighting the highly organized processes subjacent to functional B cell generation. Defining the GRN and TF regulons driving early B cell differentiation We next combined chromatin accessibility and gene expression data to derive the underlying GRNs driving early B cell differentiation in healthy humans. We defined a computational workflow to decipher the GRN from the RNA-seq and ATAC-seq coordination (fig. S2B). Briefly, we first inferred the CREs (62,559 OCR-gene pairs) and trans-regulatory elements (2,299,328 TF-OCR pairs derived from a footprinting analysis). Next, we refined the resulting TF-OCR-gene links (1,823,177 links) to focus on the regulatory programs driving B cell differentiation by (i) retaining only those interactions involving differentially expressed genes (DEGs), promoter OCRs, and differentially accessible enhancer OCRs across cell state transitions—under the premise that dynamic regulatory elements are central to lineage specification—and (ii) excluding interactions in which the TF was not differentially expressed (n = 469,073), correlations between TF gene expression and target genes were not statistically significant (P > 0.05; n = 186,837), or logical inconsistencies in correlation directionality were observed (n = 117,846). This process resulted in a high-confidence GRN with 377,268 TF-OCR-gene interactions, involving 169 TFs, 7310 target genes, and 16,074 OCRs (table S3 and fig. S6, A to C). Further details are provided in Materials and Methods. After reconstructing the global GRN architecture, we defined regulons as sets of genes associated with a specific TF, inferred through TF-OCR-gene interactions. Each regulon represents a functional unit within the GRN, consisting of a TF and its co-regulated target genes and thereby constituting a coherent regulatory program operating within a specific cellular context. The regulon’s target genes were divided into two groups on the basis of their correlation with TF activity: positively correlated targets, likely activated by the TF, and negatively correlated targets, likely repressed by the TF. To explore the cell type specificity of these regulons, we computed the regulon specificity score (RSS) on the basis of the Jensen-Shannon divergence (JSD) for both predicted target regions and target genes ([120]17, [121]25). Intuitively, the JSD measures the difference between two probability distributions, in this context, how the accessibility or gene expression profile of a regulon’s targets in one cell type diverges from the average profile across all cell types. As a result, a high RSS indicates that the regulon’s activity profile is highly distinct and concentrated in a particular cell type, reflecting strong regulatory specificity. This analysis revealed the high cell type specificity of TF regulons for both activated and repressed target genes ([122]Fig. 2, A and B, and fig. S7A). Specifically, among the regulons with activating functions, we identified five main clusters according to the TF gene expression and RSS. Several regulons were associated with HSPCs, including TAL1, MEIS1, MECOM, HLF, and GATA2 ([123]Fig. 2A, cyan cluster; termed “HSPC regulons”). Others showed enrichment throughout differentiation up to the pre–B cell stage, such as GFI1, RUNX1, and MYB ([124]Fig. 2A, blue cluster; termed “early B cell regulons”). A smaller group characterized pro–B cells, including ELK3 and BCL6B (B cell CLL/lymphoma 6B) ([125]Fig. 2A, green cluster; termed “pro–B cell regulons”), while well-known factors such as SPI1, IRF4, TCF3, MEF2D (myocyte enhancer factor 2D), and EBF1 were mainly enriched in pre–B cells and immature B cells ([126]Fig. 2A, pink cluster; termed “pre–B cell/immature B cell regulons”). Several regulons associated with late differentiation stages, including PAX5, CEBPB, SPIB, etc. ([127]Fig. 2A, black cluster; termed “late B cell regulons”). In addition, 51 TF regulons exhibited repressive activity, including PAX5, EBF1, and TCF3, according to previous knowledge ([128]Fig. 2B) ([129]26). To further understand the biological and molecular functions of the regulons, we performed a pathway enrichment analysis on their target genes ([130]Fig. 2C and table S4). We observed a significant enrichment of pathways related to B lymphopoiesis and cell cycle across all regulons. Early-stage regulons (HSPC and early B cell) showed a significant enrichment of pathways involved in cell survival, proliferation, differentiation, and migration, with proliferation and survival pathways particularly prominent in HSPC regulons. Pre–B cell/immature B cell regulons were enriched in pathways related to cellular growth, consistent with the generation of large pre–B cells, while late B cell regulons were enriched for bioenergetic pathways. Together, our multiomics strategy enabled the determination of the GRN and associated regulons governing the entire B cell differentiation process, recovering a large set of cell type–specific regulons broadly organized into five major clusters. Fig. 2. Regulatory landscape of TF regulons orchestrating early B cell differentiation. [131]Fig. 2. [132]Open in a new tab (A and B) Heatmap/dot plot showing, for each TF regulon with an activator or repressor profile, respectively, the TF gene expression on a color scale and the RSS of target regions (OCRs) on a size scale. A high RSS score (high dot size) and high TF gene expression (color prone to red) indicate that the regulon is highly specific to the corresponding cell type. TF regulons are shown by rows, and B cell subpopulations are shown by columns. TF regulons are clustered into five main groups colored as cyan, blue, green, pink, and black (right-side colored bar). (C) Circus plot summarizing the main biological pathways enriched for each regulon. On the bottom part, TF regulons are shown. TF regulons are clustered according to groups identified in (A) (right-side colored bar). On the top part, the enriched biological pathways are shown, grouped into five blocks describing bioenergetic processes (in pink); cell growth (in purple); hematopoietic and B cell lineage and cell cycle (in yellow); differentiation, proliferation, survival, and migration (in blue); and proliferation and survival (in turquoise). (D) Multidimensional scaling representation from the Pearson correlation of csGRNs. Color code by cell subpopulation. (E) Heatmap representation of degree [left; total number of edges (connections) that a TF has associated] and betweenness (right; influence of a given TF over the flow of information in the network) centrality measures computed within each csGRN. TFs with the highest degree are shown in yellow, and those with the highest betweenness are shown in cyan. Each column corresponds to a csGRN, and each row corresponds to a TF regulon. TF regulons are clustered according to patterns identified in (A) (right-side colored bar). The top 10 TF regulons of each csGRN based on degree and betweenness are highlighted in red. Uncovering the master regulators of B cell differentiation through network topology On the basis of the defined regulons, we next performed a network topology analysis to provide insights into the master regulators. Intuitively, network topology analyses are based on the structural organization of the GRN, highlighting key regulators on the basis of their position and connectivity within the network ([133]27, [134]28). Our topological analysis was based on two measures of network centrality: (i) degree, which denotes the total number of edges (connections) that a TF has associated, and (ii) betweenness, which quantifies the influence of a given TF over the flow of information in the network (table S5; see Materials and Methods). As expected, when we explored the top 10 ranked TFs by degree, we obtained several TFs involved in basic cellular survival and proliferation [several E26 transformation–specific (ETS) and Krüppel-like factor (KLF) family ([135]29), TAF1 (TATA-box binding protein–associated factor 1) ([136]30), and E2F3 ([137]31) TFs]. Meanwhile, when we explored the top 10 ranked TFs on the basis of node betweenness, we recovered critical TFs of B cell differentiation such as EBF1, TCF3, IRF4, and BCL11A ([138]6, [139]32) and some ETS and KLF family TFs related to B cell lymphopoiesis, including ERG (ETS-related gene), KLF13, ELK3, and ELK4 ([140]33–[141]35). Aiming to identify the specific master regulators of each cell subpopulation, we derived a specific GRN for each cell subpopulation [referred to as a cell type–specific GRN (csGRN); figs. S2B and S6] by incorporating chromatin accessibility and gene expression profiles within each cell type. For example, the HSPC-specific GRN included only interactions where target genes were highly expressed and linked with accessible OCRs in HSPC samples. An initial visual inspection of the retrieved csGRNs, projected into a two-dimensional space on the basis of their pairwise distances, revealed that they also recapitulated the differentiation trajectories ([142]Fig. 2D). Next, we conducted the topological analysis for each csGRN by exploring the centrality degree and betweenness measures ([143]Fig. 2E). We observed how the csGRN topology generally mirrored the regulon specificity shown in [144]Fig. 2A, as those TFs with low expression or chromatin accessibility in a specific cell state were excluded from the corresponding csGRN. Among TFs with the highest values of degree and betweenness, we recovered the primary TFs implicated in B cell commitment, differentiation, maintenance, or Ig recombination, which include TCF3, PAX5, IRF4, EBF1, and LEF1 ([145]Fig. 2E, in red) ([146]1, [147]2). Moreover, this approach highlighted the crucial role of IKZF1, SPB1, IRF2, and FLI1 in the earliest stages of differentiation, reflected by their high degree and betweenness across CLPs, pro–B cells, and pre–B cells, not previously observed under the regulon specificity analysis ([148]Fig. 2, A and E). Focusing on pro–B cell regulons ([149]Fig. 2, A and E; green cluster), we identified ELK3, PPARA (peroxisome proliferator–activated receptor α), and BCL11A as putative key regulators in the transition from pro–B cell to pre–B cell stage. The essential role of BCL11A in lymphoid development has already been reported ([150]32), while the implications of PPARA and ELK3 in B cells remain unknown. To note, the ETS-factor ELK3 has been associated with pro–B cell expansion in mice ([151]35), defined as a regulon in large and small B cells (FrC′/D—pre–B cells) in the bursa of Fabricius of young broilers ([152]36), and recently described in human B cell differentiation ([153]37). Overall, our topological analysis of csGRNs not only aligned with the functional cell type specificity of regulons but was also essential in revealing the master regulators of B cell differentiation, successfully recovering well-known TFs and highlighting putative candidate regulators, such as ELK3, that may play a key role in B cell lymphopoiesis. Characterization of the ELK3 regulon showed a putative key role in the differentiation and proliferation regulation in pro–B cells Integrating the TF regulons and comprehensive GRN analysis, we further examined the specific regulatory activities of key human B cell TFs globally. As an example and positive control of the GRN analysis, we evaluated the EBF1-specific regulon (fig. S7, B to E). EBF1 is a well-established regulator of B cell differentiation, and among its predicted upstream regulators, we identified several known B cell TFs, including PAX5, TCF3, FOXO1, SPIB (Spi-1/PU.1–related), RUNX1, and CEBPA (fig. S7B), all previously reported to regulate EBF1. In addition, within the EBF1 downstream-regulated genes, we also confirmed previously established target TFs such as PAX5, TCF3, and FOXO1 (fig. S7C) ([154]38). To further unveil the biological functions of the EBF1 regulon, we performed a pathway enrichment analysis on its activated and repressed target genes. This approach identified a significant association of the EBF1 regulon–activated genes with B cell activation, proliferation, and differentiation pathways and the EBF1 regulon–repressed genes with metabolic and homeostatic processes (table S6). Last, we explored the pathway analysis of the EBF1 regulon for each csGRN, resolving pathways significantly enriched in the pro–B cell, pre–B cell, immature B cell, transitional B cell, and naïve B cell csGRNs (fig. S7, D and E, and table S7). Specifically, this analysis highlighted the involvement of EBF1 regulon–activated genes in regulating cellular division in pro–B cells and pre–B cells. After confirming that our analysis recapitulates the known regulatory biology of EBF1, we next assessed the role of a candidate TF, ELK3. ELK3 showed high gene expression levels and TF activity in HSPCs, CLPs, and pro–B cells ([155]Fig. 3, A and B) and emerged as a regulon within pro–B cells from our GRN analysis ([156]Fig. 2). We identified 50 TFs as potential regulators of ELK3, which included ERG, STAT5A/B (signal transducer and activator of transcription 5A/B), E2F3, ETS2, and ETV6 (ETS variant transcription factor 6) as activators and SPIB (Spi-1/PU.1–related) and POU2F2 as repressors ([157]Fig. 3C). The ELK3 regulon demonstrated 1339 candidate genes positively regulated and 1294 candidate genes negatively regulated by ELK3. Within the regulated genes, 52 were TFs, including key regulators of B cell differentiation such as PAX5, IKZF1, BCL11A, MEF2A, RUNX1, STAT5A, SPI1, and SPIB ([158]Fig. 3D). To unveil the biological functions of the ELK3 regulon, we performed a pathway enrichment analysis of the activated and repressed genes. This approach revealed a significant association of the candidate ELK3 regulon–activated genes with the cell cycle regulation, including the enrichment of several pathways related to DNA replication, unsaturated fatty acid metabolism, and MAPK (mitogen-activated protein kinase) and ERK (extracellular signal–regulated kinase) cascade (table S8). On the other hand, the candidate ELK3 regulon–repressed genes showed a significant enrichment of several B cell activation, proliferation, and differentiation pathways (table S9). When we assessed the pathway analysis of the ELK3 regulon for each csGRN-resolved pathway, there was a significant enrichment in the HSPCs and pro–B cells. Specifically, it highlighted the involvement of ELK3 regulon–activated genes in regulating metabolic and cellular processes by the enrichment of the fatty acid metabolic process, PI3K (phosphatidylinositol 3-kinase)/PKB (protein kinase B; Akt) signal transduction regulation, and MAPK cascade in both HSPCs and pro–B cells. A significant enrichment of cell cycle pathways was observed in pro–B cells ([159]Fig. 3E and table S10). The ELK3 regulon–repressed genes showed specific enrichment in pro–B cells, associated with several B cell activation, proliferation, and differentiation pathways ([160]Fig. 3F and table S10). Overall, our findings identify ELK3 as a previously underappreciated TF in B cell lymphopoiesis, which may play a crucial role in regulating the cell cycle and advancing B cell differentiation during the pro–B cell stage. Fig. 3. ELK3 regulon characterization. [161]Fig. 3. [162]Open in a new tab (A) Box plot showing the distribution of ELK3 gene expression and (B) ELK3 TFBS activity score by cell subpopulation. Significant gene expression changes between cell subpopulation transitions assessed by the Wilcoxon test are shown. Significant changes were defined as P < 0.05 (NS., no significant; *P = 0.05, **P = 0.01, ***P = 0.001). (C and D) Heatmap representation of gene expression for TF regulators of ELK3 and TF regulated by ELK3, respectively. Samples are shown by columns (grouped by cell subpopulation), and TFs are shown by rows. Normalized gene expression levels are represented. On the right side, ρ values of Spearman correlation for each TF versus ELK3 are shown. On top, a bar plot representation of the ELK3 gene expression is shown. (E) Representation of the main enriched GO terms derived from the pathway analysis of the ELK3 regulon–activated genes for each csGRN. The adjusted P value is shown in a red-blue color range, and the number of genes enriched within the term (counts) is shown by dot size. (F) Representation of the main enriched GO terms derived from the pathway analysis of the ELK3 regulon–repressed genes for each csGRN. Significant terms only appeared in the assessment of csGRN derived from pro–B cells. For the pro–B cell csGRN enriched terms, the adjusted P value (red-blue color range), gene ratio, and counts (dot size) are shown. Single-cell translation fine-tunes and validates bulk-derived regulons dynamics To gain a finer-grained understanding of the transcriptional regulation of the early B cell differentiation, we analyzed the bulk-based defined regulons at the single-cell level. To this end, we interrogated two single-cell datasets: a public dataset including healthy human bone marrow cells from hematopoietic stem progenitor cells (HSPCs) to mature B cells and plasma cells from five donors (phs002371.v7.2; [163]Fig. 4A) ([164]39) and an internally generated dataset including bone marrow CD34+ hematopoietic progenitors isolated from five healthy donors [in-house dataset; mean age, 22.4 (21 to 24)] ([165]Fig. 4B; figs. S1B and S8, A to I; and table S1). This approach allowed us to explore the specific TF expression and the activity of regulons over the entire B cell differentiation, including those cellular states not currently accessible by flow cytometry–based isolation. Fig. 4. Translation of the early B cell differentiation regulatory landscape to a single-cell resolution. [166]Fig. 4. [167]Open in a new tab (A) scRNA-seq UMAP dimensionality reduction of 10,529 human bone marrow cells, recovering the B cell lymphopoiesis (public dataset phs002371.v7.2). (B) scRNA-seq UMAP dimensionality reduction of 28,891 human CD34+ cells, drawing the early B cell lymphopoiesis (in-house dataset). (C and D) Heatmap showing, for each cell (columns; grouped by cell subpopulation), the estimated gene set activity score of each TF regulon–activated gene set by calculating the AUCell score (rows) for the public and in-house datasets, respectively. Row-scaled AUCell score values are represented. TF regulons are clustered and colored according to TF regulon groups identified in [168]Fig. 2A (left-side colored bar). (E and F) Dot plot showing, for each cell subpopulation, the average gene expression and the percentage of expressing cells for PPARA, ELK3, and NR3C1 (TF regulons identified in [169]Fig. 2 as pro–B cell–type specifics) and MYBL2, FOXM1, and E2F7 (cell cycle–related TFs) in the public and in-house datasets, respectively. The average gene expression is shown in a color range, and the percentage of expressing cells is indicated by dot size. (G to I) Box plot representation of the AUCell gene set activity score for the MYBL2, FOXM1, E2F7, ELK3, NR3C1, and PPARA TF regulon–activated gene sets by cell subpopulation for public (top) and in-house (bottom) datasets. We estimated the regulon activity per cell and per regulon by calculating the AUCell score for each TF regulon–activated gene set. Briefly, AUCell scores gene set activity in single cells by ranking expression and calculating the AUC of gene enrichment per cell. Therefore, we quantified the enrichment of the positively regulated genes of each regulon within all expressed genes for each cell. In both datasets, the resulting regulon enrichment patterns closely mirrored those identified in bulk analysis ([170]Figs. 2A and [171]4, C and D), showing a moderate-to-strong correlation in more than 79% of regulons (fig. S9). Despite the similarity, the granularity of the single-cell data uncovered an increased enrichment of the regulons grouped in the blue cluster with cycling cells according to one of the main biological pathways associated with this cluster (cellular proliferation; see [172]Fig. 2C). Within this group of TFs, we identified MYBL2, FOXM1, and E2F7 showing elevated gene expression and regulon activity in cycling cells ([173]Fig. 4, E to G). In addition, we observed the increased gene expression of ELK3 and NR3C1 in pro–B cells ([174]Fig. 4, E and F) along with a refined resolution of their regulon activity profiles ([175]Fig. 4H). The NR3C1 regulon, which was highly active in pro–B cells and pre–B cells ([176]Fig. 2A), showed the highest activity in cycling pre–B cells. In the case of the ELK3 regulon, which was highly active throughout the early stages of B cell differentiation—from hematopoietic stem cells (HSCs) to pro–B cells—and significantly decreased during the transition from pro–B cells to pre–B cells ([177]Fig. 2A), a delayed reduction in activity was observed, with the most notable drop occurring between cycling pre–B cells and pre–B cells ([178]Fig. 4H). PPARA displayed a similar pattern of regulon activity to ELK3 ([179]Figs. 2, A and E, and [180]4I); however, its gene expression was slightly detected in CLPs and pre-pro–B cells and was largely absent in the other cell subpopulations ([181]Fig. 4, E and F). In summary, the single-cell translation not only validated the cell type specificity of regulons but also provided high-resolution insights into their activity dynamics throughout B cell differentiation. Identification of relevant B cell subpopulations and regulons within B-ALL subtypes Given the comprehensive regulatory landscape of our human B cell profiling, we next turned our attention to the translatability of our results within a pathologic context. We hypothesized that the accurate characterization of the B cell differentiation would provide insights into the processes underlying the dysregulation of precursor B cell differentiation–related diseases, such as B-ALL. To investigate our hypothesis, we obtained a large public dataset including several B-ALL subtypes from both child and adult individuals (Li dataset; [182]Fig. 5A and fig. S10, A to C) ([183]40). Next, we aimed to investigate the distribution of each TF regulon activity (focused on activated genes and using GSVA) in each individual sample. Investigating the derived heatmap, we did not identify a perfect association between regulatory programs and the different genetic subtypes ([184]Fig. 5B). However, we found consistently increased activity scores in patients with B-ALL from the TCF3-PBX1 (pre–B cell leukemia homeobox 1) and MEF2D fusion subtypes within pre–B cell/immature B cell regulons (pink cluster). Also, within patients from the ETV6/RUNX1 subtype, a subtle increased activity was observed for some of the regulons classified as pro–B cell regulons (green cluster), including OTX1 (orthodenticle homeobox 1), IRX2 (iroquois homeobox 2), and AIRE (autoimmune regulator) (fig. S11). Fig. 5. Uncovering B-ALL through the normal B cell differentiation regulatory landscape. [185]Fig. 5. [186]Open in a new tab (A) Schematic representation of the B-ALL public Li dataset explored, including 469 individuals (childhood and adult) from three different cohorts. The PCA of gene expression is shown. Each dot represents a sample colored by B-ALL genetic subtype. (B) Heatmap representation of the gene set activity score (GSVA) of each TF regulon–activated gene set (rows) computed for each B-ALL sample (n = 469; in columns). Row-scaled GSVA score values are represented. TF regulons are clustered and colored according to TF regulon groups identified in [187]Fig. 2A (left-side colored bar). Patients with B-ALL are clustered according to B-ALL subtype (top colored bar). The sample age (adult versus pediatric) is shown for each sample (top colored bar). (C) ORA of B-ALL subtype signatures over the B cell subpopulation signatures. The dot size represents the −log[10](adjusted P value), and the color range represents the gene ratio. (D) ORA for RUNX1-ETV6 high-affinity binding sites (n = 711) over the consensus OCRs for each B cell subpopulation. The dot size represents the −log[10](adjusted P value), and the color represents the odds ratio. (E) ORA of B-ALL subtype signatures (rows) over the TF regulon–activated gene set (columns) defined by the bulk approach. The dot size represents the −log[10](adjusted P value), and the color range represents the gene ratio. Only TFs with at least one significant association are shown. TF regulons clustered and colored according to TF regulon groups identified in [188]Fig. 2A (top-side colored bar). (F to I) Box plot showing the distribution of the gene set activity score (GSVA) computed on the TF regulon–activated gene set for TCF3, MEF2D, FOXD3, SPI1, OTX1, IRX2, STAT5B, and FOSB TF regulons by B-ALL subtype. This analysis is shown for the Li dataset and two additional datasets for validation [Rainer and Iacobucci (Iaco.)], when possible. To further investigate those initial observations, we first aimed to compare what defines transcriptional B-ALL subtypes compared to those of normal BCPs. To this end, we analyzed the enrichment of the transcriptional signatures for each of the B-ALL subtypes (fig. S10, D and E, and table S12) and transcriptional signatures of normal BCPs (fig. S4) using an overrepresentation analysis (ORA). Our analysis confirmed a statistically significant association of MEF2D fusions and TCF3-PBX1 ALL with pre–B cells and immature B cells, ETV6-RUNX1/like enriched in pro–B cells, and pointed to a significant association of BCR-ABL1 (Abelson tyrosine-protein kinase 1) with pro–B cells, ZNF348 (zinc finger and BTB domain containing 33) fusions with HSPCs, and the KMT2A (lysine methyltransferase 2A) fusion subtype with HSPCs/CLPs ([189]Fig. 5C). Moreover, when considering the chromatin accessibility data, we explored the enrichment of a set of 711 high-affinity binding sites for the ETV6-RUNX1 fusion defined by Wray et al. ([190]41) in our chromatin accessibility data, which confirmed again the significant association of the ETV6-RUNX1 ALL subtype with the pro–B cell state ([191]Fig. 5D). Once we observed significant associations between B-ALL subtypes and normal BCPs, we focused on identifying significant associations at the TF regulon level. To assess this, we calculated the enrichment of the transcriptional signatures of B-ALL subtypes in the activated target genes of regulons using the ORA. Our analysis uncovered specific TF regulons enriched in some of the B-ALL subtypes (see all regulons in [192]Fig. 5E). Some of the identified regulons were validated using two independent B-ALL public datasets: one including childhood ALL with different B-ALL subtypes (Rainer dataset, [193]GSE73578) ([194]42) and another including adult B-ALL with or without BCR-ABL1 mutation (Iacobucci dataset, [195]GSE13204) (fig. S10, F to I) ([196]43). Among those confirmed, we observed the expected elevated activity of the TCF3 regulon in the TCF3-PBX1 subtype and the MEF2D regulon in the MEF2D fusion subtype, which was also increased in the TCF3-PBX1 subtype ([197]Fig. 5F). In addition, we validated an increased activity of FOXD3 and SPI1 regulons in both subtypes, among others ([198]Fig. 5G). Although not significant in the ORA ([199]Fig. 5E), we identified an increased activity of OTX1 and IRX2 regulons—both associated with the pro–B cell state; see [200]Fig. 2—in the patients with ETV6/RUNX1 ALL ([201]Fig. 5H). Last, in the BCR-ABL1/ph–like ALL subtype, we detected a significantly increased activity score for STAT5B and FOSB—characteristics of early progenitor cells (cluster in cyan)—regulons with a particularly strong increase in the ZNF384 fusion subtype ([202]Fig. 5I). In summary, by translating our comprehensive resource to the B-ALL context, we illustrated their potential to obtain a deeper understanding of the molecular underpinnings of B cell malignancies. DISCUSSION We addressed the challenge of profiling gene expression and chromatin accessibility during the early stages of human B cell lymphopoiesis to elucidate the underlying GRN that orchestrates the differentiation process. To this end, to our knowledge, we compiled the most comprehensive sample collection to date, encompassing bulk-derived multiomics profiles of eight BCPs from 13 healthy donors (EGAS00001007296), publicly available for browsing at [203]https://translationalbio.shinyapps.io/brex/. This dataset, characterized by unprecedented granularity, provided the statistical power necessary for a detailed characterization of transcriptional and chromatin accessibility dynamics, allowing us to move beyond the transcription-based understanding of the gene regulatory landscape. The definition of the BCP subpopulations has been well established in mice—the Hardy fractions—([204]44); however, its translation to human surface markers remains controversial, and an unequivocal assignment is missing ([205]45). Here, we defined a sorting strategy that allowed the recovery of the human cell fractions equivalent to HSPCs, CLPs, pro–B cells (including FrB/C), pre–B cells (including FrC′/D), immature B cells (FrE), translational B cells (FrD), and naïve B cells (FrD) ([206]Fig. 1, A and H), providing an unparalleled resource to robustly understand the system and complete what has been previously achieved in mice or mature stages of human B cells ([207]3, [208]19, [209]46, [210]47). Although the defined strategy was accurate, we would like to note that, because of sorting limitations, the CLP fraction likely also contains early lymphoid progenitor cells (CD34^+CD38^+CD45RA^+CD133^+). Both gene expression and chromatin accessibility profiles delineate early B cell differentiation and recovered well-known biomarkers, biological processes, and key TFs ([211]Fig. 1). Chromatin accessibility and, more specifically, intronic and distal intergenic regions yield sharper distinctions between cell subpopulations than promoter regions or expression profiles (figs. S2A and S12, A and B), consistent with previous observations in the mouse immune system and human hematopoiesis ([212]19, [213]48). This increased variability supports the crucial role of enhancer elements in the B cell development regulation (fig. S12C). Given that gene expression is regulated by chromatin accessibility, in combination with other epigenetic factors ([214]49), we designed a custom joint analysis—drawing inspiration from a previous study using mouse bulk data ([215]19)—to reconstruct the GRN underlying early B cell differentiation (fig. S2). Although the computational approach involves several assumptions and may yield some false positive edges, the resulting GRN serves as a robust and valuable resource for hypothesis generation. This is supported by the following: (i) We used significant “OCR-gene” pairs derived from a large number of paired samples; (ii) we used footprinting, enabled by deep sequencing of ATAC-seq (more than 100 million reads per sample), to improve the estimation of active TF binding sites; (iii) the GRN recovers the already well-established regulators; and (iv) a specific GRN for each cell state can be derived, permitting us to investigate further the cell type specificity of the regulons ([216]Fig. 2). Moreover, we leveraged public datasets to confirm that the defined CREs are enriched by active histone marks ([217]50) and located within A-type compartments of chromatin conformation (fig. S13, A and B) ([218]51). Also, using chromatin conformation data from naïve B cells ([219]52), we validated several CREs associated with characteristic genes from the late stages of differentiation, including MS4A1 (CD20), CD79A/B, PAX5, and IKZF1 (fig. S13C). However, Hi-C or promoter Hi-C experiments covering the early B cell differentiation—currently unavailable—will be essential for refining the defined GRN and allowing further interrogation of the cis-regulatory architecture, possibly leveraging deep learning models. Also, chromatin immunoprecipitation sequencing (ChIP-seq) experiments can help to validate the regulons identified. The recovered GRN not only captured well-known regulators extensively studied in mice but also validated recently described genes contributing to human lymphoid lineage differentiation, such as MLXIP ([220]53). In addition, it provided a valuable classification of highly cell type–specific regulons and enabled the identification of master regulators, shedding light on understanding B cell differentiation and the transition to malignancy. Our GRN identified ELK3 as a putative candidate regulatory element of the early stages of B cell differentiation ([221]Fig. 2, A, C, and E), potentially involved in cell cycle progression mainly in pro–B cells ([222]Fig. 3). This TF has recently been defined as a regulon in large and small B cells (FrC′/D—pre–B cells) in the bursa of Fabricius of young broilers ([223]36), and it was associated with pro–B cell expansion in mice ([224]35). Furthermore, evidence supporting a role for ELK3 in human B cell differentiation has been reported recently ([225]37). Moreover, ELK3 has been described as a novel oncogene reported to be increased in several carcinogenic processes, including the ETV6-RUNX1–positive ALL, a kind of ALL that resembles the pro–B cell state and shows the down-regulation of the G[2]-M checkpoint pathway ([226]22, [227]54–[228]56). ELK3 inhibition has been revealed as a potential candidate therapy for tumor growth and metastasis in mice and prostate cancer cells ([229]57–[230]59). Further functional studies are needed to better understand the regulatory role of ELK3 in both healthy and disease contexts. To achieve a higher cellular resolution, which is not feasible with sorting strategies, we projected the defined bulk-derived regulons onto single-cell data. Analysis of two independent datasets further validated and refined the cell type specificity of TF regulons, which included the regulatory control of ELK3. Notably, MYBL2, FOXM1, and E2F7 regulons were identified as highly active in cycling cells, consistent with their known roles in regulating cell proliferation ([231]47, [232]60, [233]61). When we inferred regulons directly from the multiomics single-cell dataset (fig. S8, A to C, J, and K), we observed only 62 regulons (table S11), of which 34 were considered significant (21 activators, ρ > 0.7; 13 repressors, ρ < −0.70). Most of the activator regulons identified (17 of 21) matched those from our bulk approach, including HLF, MECOM, MEIS1, NFE2, PBX1, RUNX1/2, CEBPA, THRB, LYL1, FOS, FOSB, MYBL2, LEF1, TCF4, EBF1, and PAX5. However, despite the relevant overlap between both approaches, it became clear that single-cell data did not recapitulate the broader set of regulons identified through footprinting-based bulk analysis. This comparison underscores the need for integrative strategies that harness the depth of bulk data and the resolution of single-cell data to comprehensively define GRNs. Last, the translational relevance of our resource was demonstrated by leveraging the comprehensively defined regulatory landscape of early human B cell lymphopoiesis to study B-ALL malignancy. This revealed substantial heterogeneity in regulatory programs across B-ALL subtypes, with distinct associations between specific early B cell states and certain genetic subtypes, including MEF2D fusions, TCF3-PBX1, and ETV6-RUNX1/like. These findings support the concept of heterogeneous origins of B-ALL transformation and reflect the cell of origin or cellular state of malignant transformation, as recently described by Iacobucci et al. ([234]23) ZNF384 fusions were associated with HSPCs, potentially explaining the preference for allogeneic hematopoietic stem cell transplantation in this subtype, as well as the enrichment of HSC-like features in TCF3-ZNF384–positive patients ([235]62, [236]63). Similarly, KMT2A fusions showed a strong association with HSPCs and HSPC-associated regulons ([237]Fig. 5, C and E), supporting previously reported increases in lineage plasticity ([238]39). In addition, specific regulatory programs were identified and validated for various B-ALL subtypes, including the well-known increased activity of MEF2D and TCF3 in MEF2D fusions and TCF3-PBX1 subtypes, respectively ([239]64). We also identified candidate regulatory targets in other B-ALL subtypes, such as OTX1 and IRX2 in ETV6-RUNX1/like B-ALL. IRX2 has also been described as aberrantly expressed in other B-ALL subtypes, such as BCR/ABL1 and hyperdiploid karyotypes ([240]65). Moreover, OTX homeobox genes have been previously associated with B cell differentiation in Hodgkin lymphoma ([241]66). Beyond these biological insights, our work introduces a potential strategy for patient stratification based on active regulatory programs, either as an alternative to or in combination with genetic classification. Patient-specific regulon activity profiles could offer critical insights into treatment response and enable more personalized therapeutic approaches, supporting the recently uncovered correlation between developmental states and asparaginase response in B-cell ALL ([242]67). To briefly explore this concept, we performed a preliminary exploration of regulon activity in relation to treatment response using two public datasets: one evaluating CART19 (CD19-targeted chimeric antigen receptor T cell) therapy ([243]68) and another assessing the response to SMAC (second mitochondria-derived activator of caspase) mimetics ([244]69). Preliminary results suggest that nonresponders exhibit higher activity of stemness and proliferative-related regulons compared to responders (fig. S14). Last, it is important to acknowledge that, despite the high resolution achieved and the promising translatability for understanding underlying mechanisms in malignant contexts, much of the initial analysis relied on predefined early B cell subtypes. While single-cell mapping improved the dynamic resolution of regulons, their definition and associated GRNs remained largely confined to established cell populations. Consequently, part of the study is still limited to known cell types, and the existence of additional, yet uncharacterized, subpopulations cannot be ruled out. In this context, future CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) experiments may provide deeper insights by enabling more refined discrimination and isolation of cell types ([245]70). In summary, the generation of a comprehensive robust atlas of regulatory elements governing early human B cell differentiation provides a powerful resource to deeply interrogate the GRNs and epigenetic components that distinguish healthy and disease cell states. MATERIALS AND METHODS All data and codes used in the project can be found on GitHub ([246]https://github.com/TranslationalBioinformaticsUnit/bcell) and Open Science Framework (OSF; [247]https://osf.io/gswpy/). Human sample collection Human bone marrow and peripheral blood samples from the same individuals were collected for 13 healthy donors [7 males and 6 females; median age, 21 (20 to 23)] (table S1). Samples and data from donors included in the study were provided by the Biobank of the University of Navarra after informed consent was obtained and were processed following standard operating procedures. The Clinical Research Ethics Committee of the Clinica Universidad de Navarra approved the study (ethical approval 2021.035). FACS sorting Mononuclear cells were purified from whole blood and bone marrow samples using Ficoll-Paque PLUS (GE Healthcare, IL). After that, eight immune cell populations, representing the lymphoid hematopoietic differentiation cascade along the B cell lineage from adult human bone marrow precursors and peripheral blood, were obtained from each sample. These cells were purified following the sorting strategy shown in table S1 and fig. S1A in a BD FACSAria II (BD Biosciences) and then analyzed by bulk RNA-seq and bulk ATAC-seq. RNA-seq data generation Bulk RNA-seq was performed following the MARS-seq (massively parallel RNA single-cell sequencing framework) protocol adapted for bulk RNA-seq ([248]71, [249]72) with minor modifications. Briefly, cells were sorted in 100 μl of Lysis/Binding Buffer (Ambion), vortexed, and stored at −80°C until further processing. Poly-A RNA was selected with Dynabeads Oligo (dT) (Ambion) and reverse transcribed with AffinityScript Multiple Temperature Reverse Transcriptase (Agilent) using poly-dT oligos carrying a 7–nucleotide (nt) index. Up to eight samples with similar overall RNA content were pooled and subjected to linear amplification by in vitro transcription using the HiScribe T7 High Yield RNA Synthesis Kit (New England Biolabs). Next, RNA was fragmented into 250- to 350-nt fragments with RNA Fragmentation Reagents (Ambion) and dephosphorylated with FastAP (Thermo Fisher Scientific) following the manufacturer’s instructions. Next, partial Illumina adaptor sequences were ligated with T4 RNA Ligase 1 (New England Biolabs), followed by a second reverse transcription. Full Illumina adaptor sequences were added during final library amplification with KAPA HiFi DNA Polymerase (Kapa Biosystems). RNA-seq library quality controls consisted of quantification with a Qubit 3.0 Fluorometer (Life Technologies) and size profile examination using Agilent’s 4200 TapeStation System. Libraries were sequenced in an Illumina NextSeq 500 in a single end at a sequence depth of 10 million reads per sample. RNA-seq quantification and normalization Read quality was assessed with FastQC software (Babraham Institute). Low-quality, short-read, and TruSeq3-SE adapters were trimmed using Trimmomatic ([250]73) under the following parameters: ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30. Reads were then mapped to the December 2013 Homo sapiens high-coverage assembly GRCh38 from the Genome Reference Consortium GRCh38 (hg38) using STAR (version 2.6.0a) ([251]74) and counted for each gene using htseq-count ([252]75) with the –m intersection-nonempty option and a GTF file from GRCh38 Ensembl annotation release 96, getting the information for a total of 58,884 genes. Genes with a minimum read count of 1 in all population replicates and an associated ensemble ID were retained (23,454 genes). A sample identified as an outlier has been removed (sample ID: D213-4). Counts were normalized under trimmed mean of M-values (TMM) normalization and voom transformation from the edgeR r-package (version 3.24.3) ([253]76). Correlation between samples was computed using Pearson correlation (fig. S10A). RNA-seq data analysis PCA was used for data exploration ([254]Fig. 1B). Differential expression analysis was performed using limma (version 3.38.3) and edgeR (version 3.24.3) under two different approaches: (i) First is identifying biomarker genes for each cell subpopulation (all possible pair combinations). Genes with an adjusted P value ≤0.01 (Benjamini-Hochberg) and fold change ≥1.5 at least in four contrasts were defined as significant (B cell subpopulation biomarkers; fig. S4). (ii) Second is identifying DEGs between transition states (HSC versus CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.). Genes with an adjusted P value ≤0.05 (Benjamini-Hochberg) and absolute fold change ≥1.5 were considered significant. A clustering analysis was performed to identify the different expression patterns within the DEGs over transitions (second approach) using a model-base clustering based on parameterized finite Gaussian mixture models [mclust r-package (version 5.4.5) under the following function and parameter: Mclust (modelName = “EII”)] (fig. S6A) ([255]77). Gene set enrichment analysis of Gene Ontology (GO) terms was performed using the ClusterProfiler r-package (version 4.6.0) ([256]78). The fold change of DEGs at different transitions was given as input to the compareCluster function, and the top 10 enriched terms were plotted using an in-house treeplot function ([257]Fig. 1F). ATAC-seq data generation Accessible chromatin mapping was performed using FAST-ATAC-seq ([258]48) with minor modifications. Briefly, 10,000 cells were sorted in 1× phosphate-buffered saline (PBS) and 0.05% bovine serum albumin (BSA) and pelleted by centrifugation at 500 rcf for 5 min at 4°C with low acceleration and brake settings in a precooled swinging-bucket rotor centrifuge. All the supernatant but 5 μl was removed. Next, 25 μl of transposase mix [15 μl of 2× TD buffer (Illumina), 1 μl of TDE1 (Illumina), 0.25 μl of 5% digitonin (Promega), and 8.75 μl of nuclease-free water] was added to the cells, and the pellet was disrupted by pipetting. Transposition reactions were incubated at 37°C for 30 min in an Eppendorf ThermoMixer with shaking at 450 rpm. Reactions were stopped at 4°C for 5 min. To release tagmented DNA, samples were incubated at 40°C for 30 min with 5 μl of cleanup buffer [900 mM NaCl (Sigma-Aldrich), 30 mM EDTA (Millipore), 2 μl of 5% SDS (Millipore), and 2 μl of proteinase K (NEB)]. DNA was purified using a 2× SPRI beads cleanup kit (Agencourt AMPure XP, Beckman Coulter). Two sequential polymerase chain reactions (PCRs) were performed using KAPA HiFi DNA Polymerase and customized Nextera PCR indexing primers (IDT) ([259]79). The conditions of the first PCR were as follows: 5 min at 72°C and 2 min at 98°C followed by nine cycles of 20 s at 98°C, 30 s at 63°C, and 1 min at 72°C. Depending on the library concentration in this first PCR, a second PCR (2 min at 98°C followed by four to six cycles of 20 s at 98°C, 30 s at 63°C, and 1 min at 72°C) was performed, aiming for a library concentration in the range of 2 to 10 ng/μl. PCR products were purified using a 2× SPRI beads cleanup kit. Libraries were quantified, and their size profiles were examined as described above. Sequencing was carried out in two batches: the first one in an Illumina NextSeq 500 using paired-end, dual-index sequencing (Read1: 38 cycles; Read2: 38 cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 10 million reads per sample and second one in a HiSeq2500, v4 reagents using paired-end, dual-index sequencing (Read1: 125 cycles; Read2: 125 cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 100 million reads per sample. ATAC-seq quantification and normalization Read quality was assessed with FastQC software (Babraham Institute). After trimming NexteraPE-PE adaptor sequences and low-quality reads and cutting the reads from the second batch to 38 base pairs (bp) using Trimmomatic ([260]73) under the following parameters: CROP:38 ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15. Reads were then mapped to the hg38 reference genome (GRCh38 Ensembl annotation release 96) using BWA-MEM ([261]80). Unmapped, unknown, random, ChrM mapping, duplicate reads and reads matching the black list [the blacklist includes problematic regions of the genome ([262]81)] were filtered out using samtools view -b -F 4 -F 1024 (samtools version 0.1.19) and Picard Tools (Picard MarkDuplicates; [263]http://broadinstitute.github.io/picard). BAM files from resequenced samples were merged using samtools merge. Paired-end reads were used to determine those regions with significant read coverage in contrast to the random background (regions known as peaks or OCRs) using MACS2 (function callpeak) with the following parameters: -g hs -f BAMPE -q 0.01 --nomodel --shift -100 --extsize 200 -B –SPMR ([264]https://github.com/taoliu/MACS) ([265]82). A minimum OCR size of 20 bp was set as a threshold. To get the consensus OCRs for each cell subpopulation, we found the genome regions that are covered by at least n − 1 of the sets of peaks (n being the total number of samples for a specific cell subpopulation), and then we merged the nearby peaks with a minimum gap width of 31. To get the consensus OCRs in the whole population, we get the regions in the genome where the coverage is at least in 1 of the consensus OCRs from the different cell subpopulations and then merged the nearby peaks with a minimum gap width of 31, resulting in 105,530 ATAC-seq OCRs. These processes were done using the rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2) r-packages ([266]83, [267]84). BigWig files were generated for each BAM file using the bamCoverage function (--binSize 20 --normalizeUsing RPKM --effectiveGenomeSize 2913022398) from deepTools (version 3.3.1) ([268]85). Then, a merged BigWig for each cell subpopulation was obtained through bigWigMerge from UCSC ([269]http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/). Also, BigBed files were generated from consensus BED files of each cell subpopulation using bedToBigBed from UCSC. These data were deposited in a public custom track from UCSC ([270]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex). Using the whole consensus OCRs, we compute the count matrix using the multicov function from bedtools (version 2.25.0) ([271]86). Those OCRs without a minimum of 10 counts for at least one sample and 15 total counts were excluded (155 OCRs), retaining 105,359. Then, we normalize the data using the TMM method from the edgeR r-package (version 3.30.3) ([272]76). One of the samples analyzed was identified as an outlier and excluded (sample ID: D215-5). Correlation between samples was computed using Pearson correlation (fig. S10B). ATAC-seq data analysis PCA was used for data exploration ([273]Fig. 1C). OCRs were mapped to genomic locations (promoter, intronic, distal intergenic, etc.) using the ChIPseeker r-package (version 1.18.0) ([274]87) under the following arguments: region range of transcription start site (TSS) as +/− 1000 bp, gene level, “TxDb.Hsapiens.UCSC.hg38.knownGene” TxDb object (version 3.4.0) and “org.Hs.eg.db” annotation package (version 3.7.0) (fig. S2A). The Gini index was computed for each OCR to measure “chromatin inequality” (fig. S10B). A Gini index of 0 represents perfect equality and similar chromatin accessibility in all cell subpopulations, while a Gini index of 1 represents perfect inequality, all the accessibility OCRs come from a specific cell subpopulation, and the others show no accessibility. To compute the Gini index, the maximum chromatin accessibility observed from the normalized count matrix was assigned for each cell subpopulation. Then, the gini function from the REAT r-package (version 3.0.2) was applied ([275]88). Differential accessibility analysis was performed to identify differential accessibility regions between transition states (HSC versus CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.) using a quasilikelihood negative binomial generalized log-linear model (glmQLFit) from edgeR (version 3.24.3) ([276]76). OCRs with an adjusted P value ≤0.01 (Benjamini-Hochberg) and absolute fold change ≥2 were defined as significant. A clustering analysis was performed to identify the different accessibility patterns within the differentially accessible regions (DARs) under a model-base clustering based on parameterized finite Gaussian mixture models [mclust r-package (version 5.4.5) under the following function and parameter: Mclust(modelName = “EII”)] (fig. S6B) ([277]77). To explore the accessibility of TFs, we computed the aggregated motif score (TFBS accessibility score) for each TF using the chromVAR r-package (version 1.5.0) ([278]89) with the default parameters. For plotting, the TFBS score was used ([279]Fig. 1E and fig. S3). Gene set variation analysis (GSVA) We performed the GSVA to compute the enrichment of specific gene sets within each individual sample. We applied GSVA to explore the transcriptional proximity of different B cell states between humans and mice ([280]GSE109125) ([281]Fig. 1H) ([282]19). To this end, we used the defined biomarker gene sets for each B cell subpopulation in humans (fig. S4) and computed the GSVA enrichment scores using the algorithm of Hänzelmann et al. ([283]90) by calling the gsva() function from the GSVA R package (version 1.50.5). We also used GSVA to assess the gene set enrichment of TF regulons in each B-ALL sample. We computed the GSVA of the TF regulons within the datasets of Li, Rainer, Iacobucci, Singh, and Zinngrebe ([284]Fig. 5, B and F to I, and fig. S12). To explore significant differences at the TF regulon level regarding treatment response, we used the GSVA results from the Singh and Zinngrebe datasets as input and performed differential analysis using limma (version 3.58.1) and sva (version 3.50.0) to adjust for sample heterogeneity using surrogate variable analysis ([285]91). TF regulons with an adjusted P value ≤0.05 (Benjamini-Hochberg) were defined as significant (fig. S12). OCR variance component analysis We used a linear mixed model approach for assessing the proportion of variance in gene expression explained by promoter and distal enhancer OCRs’ activity (fig. S10C). For each gene, we fit the following linear mixed model regression [MATH: Y=U_promoter+< mi mathvariant="normal">U_enhancers+ Epsilon :MATH] where Y is a vector of zero-centered gene expression values, U_promoter and U_enhancer are random effects, and Epsilon explains noise unaccounted by the model. The limix python library was used for fitting the regression models ([286]https://github.com/limix/limix) ([287]92). OCRs association with target genes (CREs) To determine the CREs of a gene (OCR-gene pairs), we follow two approaches: (i) For promoter OCRs, we directly annotate OCRs to genes based on TSS distance (<1 kb), and (ii) for intronic or distal intergenic OCRs, we explored the association between chromatin accessibility and gene expression. First, we identified all intronic or distal intergenic OCRs for each expressed gene within a window of ±1 Mb from the gene’s TSS. Second, the Spearman correlation was computed across all samples (59 samples; eight cell subpopulations) for each OCR-gene pair (fig. S2B). Bonferroni correction was used to adjust for multiple testing. OCR-gene pairs with an adjusted P value <0.05 were considered significant. A complete list of pairs (62,559) is available in OSF ([288]https://osf.io/gswpy/). OCRs association with TFs (trans-regulatory elements) To annotate OCRs to putative TF binding motifs (TF-OCR pairs), we first identify the footprints within our OCRs. For each sample, we recognize the footprints using two algorithms, HINT ([289]93) and Wellington ([290]94), using as input the corresponding sample BAM file and BED file from the corresponding cell subpopulation. For the footprints identified by HINT, we filtered out those footprints with less than 20 counts. Then, we get the union of the footprints identified for each sample through both strategies. Next, we compute the consensus footprint for each cell subpopulation following a similar strategy as for the OCR consensus. To get the consensus footprint for each cell subpopulation, we found the genome regions covered by at least n − 1 of the sets of regions (n being the total number of samples for a specific cell subpopulation). Last, to get the consensus footprint in the whole population, we get the regions in the genome where the coverage is at least in 1 of the consensus footprint from the different cell subpopulations, and then those with a width less than 6 bp were removed, resulting in a total of 380,922 ATAC-seq footprints. These processes were done using the rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2) r-packages ([291]83, [292]84). The identified footprints were located in 72,771 OCRs (69%). After identifying the footprints, we want to annotate the putative TF binding motif to each footprint. To do that, we used the Vierstra database ([293]95). First, we generate a reduced version of the Vierstra database, getting those annotations falling within our OCR regions using TABIX ([294]96). Then, we annotated at the archetype level (version 1.0) to reduce motif redundancy and overlap, resulting in 30,258,213 region-motif annotations and 286 motif archetypes. The overlap between footprints and motif archetypes was explored, defining a minimum overlap of 90% of archetype width. Last, footprints were linked to the original OCR to get the OCR-archetype matrix, resulting in a matrix 105,530 by 282. To translate archetypes to TFs, we define the following strategy. First, we compute the aggregated motif score (TFBS accessibility score) for each archetype using the chromVAR r-package (version 1.5.0) ([295]89) with the default parameters. Then, to identify the TFs most likely associated with an archetype, we compute, for each archetype, the Spearman correlation between the TFBS score of the archetype and the RNA expression of each TF clustered in that archetype. Correction for multiple comparisons was done using the Benjamini-Hochberg method. Moreover, the TFBS score variability and the coefficient of variation of TF gene expression were computed. We determine the TF-archetype association when (i) significant variation for both the TFBS score and TF gene expression and significant and positive correlation is observed (adjusted P value <0.05 and ρ > 0) and (ii) no considerable variation is observed for the TFBS score (<1Q TFBS variability) or TF gene expression (<1Q RNA coefficient of variation), independently of the correlation result, assuming that correlation will not be meaningful because of the low variability of the variables assessed. Last, we obtained a list of archetype-TF associations that allowed us to translate archetypes to TFs and get the OCR-TF matrix (105,530 by 275) with 60,881 OCRs linked at least to one TF motif. Generating the GRNs and regulons of B cell differentiation On the basis of the different associations of OCR-gene pairs (cis-regulation; 62,559 links) and TF-OCR pairs (trans-regulation; 2,299,328 links), we defined the GRN for early B cell differentiation (fig. S2B). To do that, from the defined OCR-gene and TF-OCR pairs, we first keep those interactions derived from the DEGs (differential expression at least in one transition), their promoter regions, and the associated enhancer elements with differential accessibility (differential chromatin accessibility at least in one transition) under the assumption that those genes and OCRs changing across cell subpopulations are going to be the ones governing the B cell differentiation process. Then, having the putative TF-OCR-gene interactions (n = 1,151,024), the correlation between TF-OCR and TF-gene was computed using Spearman correlation. To reduce spurious and meaningless TF-OCR-gene interactions, we excluded from the GRN those where (i) the TF showed no differential expression (n = 469,073); (ii) TF-gene association was not significantly correlated (Spearman P value >0.05; n = 186,837); (and iii) meaningless interactions including (a) positive correlation for TF-gene and TF-OCR but negative correlation for TF and gene, (b) negative correlation for TF-gene and TF-OCR but positive correlation for TF and gene, (c) positive correlation for TF-gene and TF-OCR but negative correlation for OCR-gene, (d) negative correlation for TF-gene, TF-OCR, and OCR-gene, and (e) positive correlation for TF-gene and OCR-gene but negative for TF-OCR (n = 117,846). Last, 377,268 TF-OCR-gene interactions were retained to reconstruct the GRN of the B cell differentiation process, involving 169 TFs, 7310 genes, and 16,074 OCRs. By exploring the TF-gene associations, we defined specific gene sets for each TF, referred to as regulons (TF regulons). Specific GRNs for cell subpopulations (csGRN) were computed on the basis of the chromatin accessibility and gene expression for each cell subpopulation. For example, for HSCs, the GRN captures those TF-OCR-gene interactions where genes were highly expressed in that cell subpopulation according to gene expression patterns (fig. S6A; genes included in brown clusters), and a median of normalized chromatin accessibility ≥1 (defining accessible OCR) was observed. Regulon cell type specificity and functionality analysis To evaluate the cell type specificity of each predicted regulon, we calculated the RSSs per cell type and regulon. The RSS algorithm is based on the JSD, a measure of the similarity between two probability distributions ([296]25, [297]97). Specifically, we used the binary enrichment of regulon activity (based on target regions or target genes) together with the cell assignments to see how specific each predicted regulon is for each cell type. To compute the binarized regulon activity, we used the GSVA score instead of the AUCell used in single-cell data. Then, a threshold of 0.2 was defined for binarizing the data, resulting in a matrix where, for each sample and regulon, we know if it is enriched (1) or not (0). On the basis of these binarized data and the cell type assignment for each sample, we calculated the RSS using the calculate_rss function from the scFunctions r-package (version 0.0.0.9000) ([298]https://github.com/FloWuenne/scFunctions/). The output was a data frame with an RSS score for each regulon-cell type combination. The RSS was computed twice for each regulon according to TF-gene behavior, a positive correlation between TF and gene pointed to an activator role and a negative correlation to a repressor role. A heatmap/dot plot shows the TF expression of the regulon on a color scale and cell type specificity (RSS) of the regulon on a size scale ([299]Fig. 2, A and B, and fig. S7A). To explore the biological functions of the defined regulons, we performed an ORA of the target genes of each regulon using the enricher function of the clusterProfiler r-package (version 3.18.1) ([300]78). The gene ratio and −log[10] from the adjusted P value under the Benjamini-Hochberg method were computed using GO or Kyoto Encyclopedia of Genes and Genomes databases (table S4). Top significant and highly relevant biological pathways were selected for the summary circular plot ([301]Fig. 2C) generated using the chordDiagram function from the circlize r-package (version 0.4.15) ([302]98). For the comparison of the pathway enrichment analysis for a list of gene sets, such as the target genes of a regulon in each cell state ([303]Fig. 3, E and F, and fig. S7, C and D), we used the compareCluster function from the clusterProfiler r-package (version 4.10.0) ([304]78). Topological property and network distance estimation and visualization According to the structure of our GRNs, different network properties were computed using the igraph r-package (version 1.3.5) ([305]99). Node betweenness (the amount of influence a node has over the flow of information in a graph through the shortest path calculation between all pairs of nodes) and degree (the total number of edges adjacent to a node) were evaluated for the complete set of GRNs with weighted edges according to Spearman’s coefficient of correlation between TF and gene ([306]Fig. 2E). Moreover, Pearson correlation was computed to compare the networks between them, and multidimensional scaling was used for visualization ([307]Fig. 2D). Single-cell samples (in-house dataset) A total of five human bone marrow samples was collected in our institution for the single-cell analysis after informed consent was obtained. Four samples were CD34+ human bone marrow cells, processed by a single-cell multiome, and one human bone marrow sample sorted for hematopoietic lineage (sorting strategy: CD34+ CD38− CD3− CD64− CD56- CD90+; fig. S1B) was divided to perform both single-cell RNA-seq (scRNA-seq) and single-cell ATAC-seq (scATAC-seq) analyses. Single-cell multiome data generation (in-house dataset) The epigenomic landscape and gene expression of 10,000 individual nuclei from the bone marrow were examined using Single Cell Multiome ATAC + Gene Expression (10x Genomics) according to the manufacturer’s instructions ([308]CG000365 RevC and [309]CG000338 RevF). Briefly, 100,000 CD34+ cells were lysed for 2 min, and then 16,100 nuclei were transposed in bulk. Next, nuclei were partitioned in gel bead-in-emulsions (GEMs) in the Chromium X instrument (10x Genomics), where each nucleus was captured separately and its transposed DNA and cDNA were uniquely indexed. Preamplified transposed DNA and cDNA were then used to generate separate ATAC-seq and RNA-seq dual-indexed libraries. Sequencing was performed in a NextSeq2000 (Illumina) at a depth of at least 20,000 reads per nucleus for gene expression (Read1: 28; Read2: 90; i7 index: 10; i5 index: 10) and 25,000 reads per nucleus for chromatin profiling (Read1: 50; Read2: 49; i7 index: 8; i5 index: 24). scRNA-seq data generation (in-house dataset) The transcriptome of the hematopoietic cells from bone marrow was examined using NEXTGEM Single Cell 3’ Reagent Kits v2 (10x Genomics) according to the manufacturer’s instructions. Between 17,600 and 20,000 cells were loaded at a concentration of 700 to 1000 cells/μl on a Chromium Controller instrument (10x Genomics) to generate single-cell GEMs. In this step, each cell was encapsulated with primers containing a fixed Illumina Read 1 sequence, a cell-identifying 16-nt 10x barcode, a 10-nt unique molecular identifier (UMI), and a poly-dT sequence. Upon cell lysis, reverse transcription yielded full-length, barcoded cDNA. This cDNA was then released from the GEMs, PCR amplified, and purified with magnetic beads (SPRIselect, Beckman Coulter). Enzymatic fragmentation and size selection were performed to optimize cDNA size before library construction. Fragmented cDNA was then end repaired, A-tailed, and ligated to Illumina adaptors. Final PCR amplification with barcoded primers allowed sample indexing. Library quality control and quantification were performed using a Qubit 3.0 Fluorometer (Life Technologies) and Agilent’s 4200 TapeStation System (Agilent), respectively. Sequencing was performed in a NextSeq500 (Illumina) (Read1: 26 cycles; Read2:55 cycles; i7 index: 8 cycles) at an average depth of 60,226 reads per cell. scRNA-seq data analysis (in-house dataset) The bcl (binary base call) sequence file format obtained on the 10x Genomics Chromium sequencing platform was transformed into fastq format using mkfastq from Cell Ranger 7.1.0 (for scRNA-seq) ([310]71) or Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count matrix used for downstream analysis was generated using the count pipeline from Cell Ranger 7.1.0 (for scRNA-seq) or Cellranger-arc 2.0.2 (for multiome single-cell sequencing) with the GRCh38 reference transcriptome provided by 10x Genomics. Following Seurat guidelines (version 4.3.0.1) ([311]100), several cell and feature quality filtering steps were applied to each sample. First, doublets were removed using scDblFinder (version 1.12.0) ([312]101), and then cells with >30% of mitochondrial percentage in single-cell multiome or >10% in scRNA-seq were filtered out. Also, cells with <5% of ribosomal percentage were filtered out in the scRNA-seq sample. Last, we filtered out extreme outlier cells or features on the basis of lower (0.05) to upper (0.95) bound quantiles for “number of features” and “counts per cell.” Normalization of the gene expression measurements for each cell was done by multiplying the total expression by a scale factor (10,000) and log-transforming it. Feature selection criteria have been used for downstream analysis based on most variable genes using 2000 top genes. Feature subspaces were obtained from most variable genes using PCA using 20 components. Clustering was computed using the Louvain algorithm over the principal components subspace with a resolution value of 0.5. For data visualization, uniform manifold approximation and projection (UMAP) was used, computing it over all PCA components obtained. Then, a first annotation based on several hematopoietic biomarkers for stem cells, lymphoid cells (T and B cells), and myeloid cells (erythroid, monocytic, and granulocytic cells) was done. Those cell clusters representing undifferentiated cells and lymphoid lineage within the analyzed CD34+ cells were kept for forward analysis. Next, to integrate the five single-cell transcriptional profiles, we merged the samples using Seurat. Merged samples were normalized on the basis of the gene expression measurements for each cell by multiplying the total expression by a scale factor (10,000) and log-transforming it. Regress-out of cell cycle was applied. Feature selection criteria were used for downstream analysis based on most variable genes using 2000 top genes. Feature subspaces were obtained from most variable genes using PCA using 20 components. To remove the influence of the dataset of origin, we applied Harmony ([313]102). Clustering was computed using the Louvain algorithm over the principal components subspace with a resolution value of 0.8 based on the clustree approach (version 0.5.0) ([314]103). To annotate cell identity, we checked for B cell biomarkers, cell cycle phase, and AUCell score (AUCell version 1.20.2) ([315]18) derived from our bulk signatures (fig. S8). We also computed the AUCell score to assess the gene set activity of TF regulons at a single-cell resolution and visualized the overall profile with a heatmap representation scaled by rows ([316]Fig. 4D). scATAC-seq data generation (in-house dataset) The scATAC experiments were performed with the Chromium Next GEM Single Cell ATAC kit version 1.1 from 10x Genomics following the manufacturer’s instructions. For samples with 100,000 cells or less, the nucleus isolation is done following the 10x Genomics–demonstrated protocol for low cell input. Briefly, cells are collected in PBS and 0.04% BSA at 4°C; after that, cells are counted in a cell counter; and centrifugation is done to concentrate them in 50 μl of PBS and 0.04% BSA. After that, 45 μl of supernatant is removed, and 45 μl of chilled and freshly prepared lysis buffer is added. After 5 min of incubation on ice, 50 μl of wash buffer is added. After that, centrifugation is done to remove 95 μl of the supernatant without disrupting the pellet. Additional centrifugation is done with 45 μl of diluted nucleus buffer to remove the supernatant. Isolated nuclei are resuspended in 7 μl of chilled diluted nucleus buffer. Two of these microliters are used to determine the cell concentration using a cell counter. The other 5 μl is used immediately to generate a bulk transposition reaction following the manufacturer’s instructions and conditions. In the transposase reaction, the transposase enters the nuclei and preferentially fragments the DNA in open regions of the chromatin. In this step, adapter sequences are added to the ends of the DNA fragments. After transposition, GEMs are generated in the Chromium controller (10x Genomics). Upon GEM generation, the gel bead is dissolved. Oligonucleotides containing an Illumina P5 sequence, a 16-nt 10x barcode, and a Read1 sequence are released and mixed with DNA fragments and master mix. After incubation, GEMs are broken, and pooled fractions are recovered. P7, a sample index, and the Read2 sequence are added during library construction via PCR. Library quality control and quantification were performed using a Qubit 3.0 Fluorometer (Life Technologies), Agilent’s 4200 TapeStation System, and Bioanalyzer 2100 (Agilent). Sequencing was performed using a NextSeq500 (Illumina) (Read1: 50 cycles; i7 index: 8 cycles; i5 index: 16 cycles; Read2: 50 cycles) to an average depth of 14,900 fragments per cell. scATAC-seq data analysis (in-house dataset) The bcl file format obtained on the 10x Genomics Chromium sequencing platform was, in this case, transformed into fastq format using mkfastq from Cell Ranger ATAC 1.1.0 (for scRNA-seq) ([317]104) or Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count matrix used for downstream analysis was generated similarly using the count pipeline from Cell Ranger ATAC 1.1.0 (for scATAC-seq) or Cellranger-arc 2.0.2 (for multiome single-cell sequencing) with the GRCh38 reference transcriptome provided by 10X Genomics. To make bulk ATAC-seq and scATAC-seq comparable, a requantified version of the scATAC-seq original count matrix was conducted to generate a new feature × cell raw count matrix with the same OCRs from bulk as the feature space using the FeatureMatrix function from the Signac package ([318]105). The new scATAC-seq matrix contains recalculated count values based on scATAC-seq genomic fragments according to the bulk ATAC-seq feature space. Then, we implemented quality control criteria, retaining cells with a FRIP (fraction of reads in peaks) value of at least 50% and a TSS enrichment score above 5. Cells with a log-transformed number of unique fragments per barcode below 3.8 were filtered out. Furthermore, doublet cells were identified and excluded using the Scrublet algorithm with default parameters ([319]106). Dimensionality reduction was conducted using latent dirichlet allocation by performing Gibbs sampling for various topic counts (K). The optimal K value was selected on the basis of the highest consensus scores from four pycisTopic proposed metrics ([320]107). Accessibility imputation was carried out using the impute_accessibility function from the pycisTopic package ([321]17), with a scaling factor of [MATH: 106 :MATH] to normalize accessibility scores across selected cells and regions. Subsequently, gene activity scores were computed using the get_gene_activity function from the pycisTopic package, incorporating region-cell probabilities, distance weights, and upstream/downstream search spaces ranging from 1 to 100 kb to refine gene activity assignments. Default values of distance weights and other parameters, such as gene boundaries and TSS extensions, were used to optimize the accuracy of the activity predictions. Label transfer was executed in two phases. First, annotations derived from scRNA-seq were transferred to the multiome scATAC-seq (activity scores) samples using annotation transfer methods including Harmony, BBKNN, Scanorama, and canonical correlation analysis. The accuracy of these transfers was evaluated against the annotations of the corresponding cell labels in the scRNA-seq, with Harmony yielding the highest accuracy across four multiome samples. In the second phase, this transfer method was used to annotate the multiome samples and the unpaired scATAC-seq sample. Last, samples were integrated using the Harmony method ([322]102), facilitating the generation of a common subspace for visualization. scATAC-seq regulon analysis (in-house dataset) To explore enhancer GRNs (eGRNs) from single-cell multiomic data, we integrated the previously identified scATAC-seq data with guidelines suggested by the SCENIC+ workflow ([323]17). In line with SCENIC+ recommendations, we identified candidate enhancer regions by binarizing region-topic probabilities and analyzing DARs. The binarization process used Otsu’s method, followed by dropout imputation by multiplying cell-topic and topic-region distributions, as previously established by cisTopic ([324]107). Normalization of the data was conducted using a scale factor of 10^4, after which highly variable peaks were identified. DARs were computed using the Wilcoxon rank-sum test, maintaining the same ordered labeling of cell types that was established in the prior scRNA-seq analysis. To generate eGRNs, we merged scRNA-seq and scATAC-seq data to create a pseudomultiome dataset. For each cell type annotation, we randomly selected cells from both datasets. Meta-cells were constructed by averaging scRNA-seq and scATAC-seq counts for each annotation with K = 5 cells. Motif enrichment analysis was then performed using the SCENIC+ pipeline, integrating binarized topics and DARs from pycisTopic. We applied the cisTarget and DEM (differential enrichment of motifs) algorithms for motif discovery. cisTarget identified genomic regions from the SCREEN database (HG38) using the Cluster-Buster tool ([325]108) and ranked motifs by motif scores. A minimum overlap of 40% between input regions and reference motifs was required. Recovery curve-based scoring and normalized enrichment scores (NESs) were computed for each motif, retaining only motifs with an NES >3. DEM performed a Wilcoxon test to assess motif enrichment between foreground and background regions. Motifs with adjusted P values <0.05 (Bonferroni corrected) and a log[2] fold change >0.5 were selected. Cistromes were generated by merging motif-based regions associated with TFs, providing a comprehensive map of TF-target regulatory interactions. Last, region-to-gene distances were computed, and TF-target importances were evaluated using the gradient boosting machine method, adhering to SCENIC+ guidelines. scRNA-seq data analysis (public dataset, phs002371.v7.2) We downloaded the FASTQ files for the phs002371.v7.p2 study from Human Tumor Atlas Network (HTAN; phs002371.v7.p2) ([326]39). To generate the count matrices needed in the downstream analysis, we used the count pipeline from Cell Ranger 7.1. with the GRCh38 reference transcriptome provided by 10x Genomics. Following Seurat guidelines (version 5.0.1) ([327]109), several cell and feature quality filtering steps were applied to each sample. First, doublets were removed using scDblFinder (version 1.12.0) ([328]101), and then following the authors’ preprocess, cells with >15% of mitochondrial percentage, cells with fewer than 1500 UMIs or greater than 50,000 UMIs, and cells with hemoglobin detection (>10 counts of hemoglobin-related genes) were filtered out. Healthy samples (n = 15 samples from five donors) were merged, and normalization of the gene expression measurements for each cell was done by multiplying the total expression by a scale factor (10,000) and log-transforming it. Feature selection criteria have been used for downstream analysis based on most variable genes using 1500 top genes. Regress-out of counts and mitochondrial percentage was applied. Feature subspaces were obtained from most variable genes using PCA using 20 components. Clustering was computed using the Louvain algorithm over the principal components subspace with a resolution value of 0.5. For data visualization, UMAP was used, computing it over all PCA components obtained. Then, a first annotation based on several hematopoietic biomarkers for stem cells, lymphoid cells (T and B cells), and myeloid cells (erythroid, monocytic, and granulocytic cells) was done. Those cell clusters representing undifferentiated cells and lymphoid lineage within the analyzed cells were kept for forward analysis. Next, to integrate the 15 single-cell transcriptional profiles of the subset of cells of interest, the merged samples were normalized on the basis of the gene expression measurements for each cell by multiplying the total expression by a scale factor (10,000) and log-transforming it. Regress-out of cell cycle was applied. Feature selection criteria were used for downstream analysis based on most variable genes using 2000 top genes. Feature subspaces were obtained from most variable genes using PCA using 20 components. To remove the influence of the dataset of origin, we applied Harmony ([329]102). Clustering was computed using the Louvain algorithm over the principal components subspace with a resolution value of 0.8 based on the clustree approach (version 0.5.0) ([330]103). To annotate cell identity, we checked for cell biomarkers, cell cycle, and AUCell score (AUCell version 1.20.2) ([331]18) derived from our bulk signatures (fig. S8). Public RNA-seq data from patients with BCP ALL (Li dataset) Transcriptomic data from patients with BCP ALL (leukemic cells from bone marrow aspirates) were obtained from the European Genome-phenome Archive. The RNA-seq data (BAM files) from EGAD00001002151, EGAD00001002692, and EGAD00001002704 datasets were downloaded after the corresponding access request. Data from adult and pediatric individuals belonging to a well-known molecular subgroup ([332]40), including MEF2D fusions, TCF3-PBX1, ETV6-RUNX1/like, DUX4 fusions, ZNF384 fusions, BCR-ABL1/Ph–like, hyperdiploidy, and KMT2A fusions were included in the analysis, resulting in 469 patients ([333]Fig. 5A). As BAM files were mapped to GRCh37, we first obtained the Fastq files using the bamtofastq function from bedtools (version 2.25.0) ([334]86). Next, the read quality was assessed with FastQC software (version 0.11.7) (Babraham Institute). Low-quality, short-read, and TruSeq3-SE adapters were trimmed using Trimmomatic (version 2.6.0a) ([335]73) under the following parameters: ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75. Reads were then mapped to the December 2013 H. sapiens high-coverage assembly GRCh38 from the Genome Reference Consortium GRCh38 (hg38) using STAR (version 2.6.0a) ([336]74) and counted for each gene using htseq-count ([337]75) with the –m intersection-nonempty option and a GTF file from GRCh38 Ensembl annotation release 96, getting the information for a total of 58,884 genes. Genes with a minimum read count of 5 in all samples were retained (32,416 genes). Next, batch correction was done using the ComBat_seq function from the sva r-package (version 3.38.0) ([338]91, [339]110). Counts were normalized under TMM normalization and voom transformation from the edgeR r-package (version 4.0.2) ([340]76). PCA was used for data exploration (fig. S10, A to C). Differential expression analysis of leukemic subtypes was performed using limma. Genes with an adjusted P value ≤0.01 (Benjamini-Hochberg) and fold change ≥1.5 in at least six contrasts were defined as significant to get the B-ALL subtypes biomarkers. The B-ALL subtype biomarkers list can be found in table S12. ORA in public BCP ALL data (Li dataset) The ORA of the gene sets from B-ALL subtype signatures (fig. S9, D and E, and table S12) was performed in the signatures of the different B cell populations (fig. S4) using the enricher function of the clusterProfiler r-package (version 3.18.1) ([341]Fig. 5C) ([342]78). The gene ratio and −log[10] from the adjusted P value under the Benjamini-Hochberg method were computed. The ORA was also applied to explore the enrichment of 711 binding sites defined as the highest-affinity binding sites for the ETV6-RUNX1 fusion under ChIP-seq analysis ([343]41) in the consensus OCRs for each cell subpopulation ([344]Fig. 5D). Last, we applied the ORA to explore whether there was a significant enrichment of the regulons identified along B cell differentiation in the different B-ALL subtype signatures derived from the Li dataset ([345]Fig. 5E). Public transcriptomic data from patients with BCP ALL Transcriptomic data from patients with BCP ALL (leukemic cells from bone marrow aspirates) were obtained from different public datasets to validate the results found under the Li dataset approach. Four different datasets were obtained from Gene Expression Omnibus (GEO): Rainer dataset ([346]GSE73578) ([347]42), microarray-based whole-genome expression profiling of lymphoblasts from 46 patients; Iacobucci dataset ([348]GSE13204) ([349]43), microarray gene expression profiling of 144 adult patients with B-ALL and who are B-ALL negative for known molecular rearrangements; Singh dataset ([350]GSE153670) ([351]68), RNA-seq data from bone marrow samples taken from patients enrolled on ENSIGN or ELIANA clinical trials before infusion of CART19 (n = 27); Zinngrebe dataset ([352]GSE140556) ([353]69), microarray data from eight SMAC mimetic monotherapy (SM)–sensitive and six SM–insensitive BCP ALL samples. Microarray processed data downloaded from GEO were used for the computation of GSVA score for each dataset. In the case of the Singh dataset, the downloaded raw counts from RNA-seq were filtered, defining a minimum read count of 5 in all samples per gene and normalized under TMM normalization and voom transformation from the edgeR r-package (version 4.0.2) ([354]76). PCA was used for data exploration (fig. S10, F to I). Public ChIP-seq histone marks ChIP-seq data from the ENCODE project ([355]www.encodeproject.org/) for H3K4me1, H3K4me3, and H3K27ac on B cells, naïve B cells, and common CD34+ myeloid progenitors were downloaded (table S1). To get the consensus regions for each cell subpopulation and histone mark, we found the genome regions that are covered by at least n − 1 of the sets of peaks (n being the total number of samples for a specific cell subpopulation), and then we merged the nearby peaks with a minimum gap width of 31. To get the consensus regions for a histone mark in the whole population, we get the regions in the genome where the coverage is at least in 1 of the consensus histone marks from the different cell subpopulations and then merged the nearby peaks with a minimum gap width of 31, resulting in 166,295 H3K4me1 regions, 37,952 H3K4me3 regions, and 159,079 H3K27ac regions. These processes were done using the rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2) r-packages ([356]83, [357]84). Then, the overlap between our OCRs and histone marks was assessed using the findOverlaps function (minoverlap = 6) of GenomicRanges (fig. S13A). Public Hi-C data Hi-C public data of naïve B cells were obtained from EGAD00001006486 ([358]51). The valid reads of Hi-C experiments were processed with TADbit version 1.1 ([359]111). Hi-C data were normalized using the oned function of the dryhic r-package (version 0.0.0.9100) ([360]112) at a 100-kb resolution to remove known experimental biases, a method that controls for the presence of abnormal karyotypes in cancer samples. Topologically associating domains were detected at a 40-kbp resolution using the TopDom algorithm ([361]113). The significant Hi-C interactions were called with the Fit-Hi-C tool (version 1.1.3) ([362]114), binned at a 10-kb resolution and with the default thresholds: P value <0.001 and q-value <0.001. To study the compartmentalization of the genome in naïve B cells, we used merged biological replicates, resulting in interaction Hi-C maps with around 300 million valid reads each. Briefly, normalized chromosome-wide interaction matrices at a 100-kb resolution were transformed into Pearson correlation matrices. These correlation matrices were then used for PCA, for which the first eigenvector (EV) normally delineates genome segregation. The multimodal distribution of the EV coefficients from the B cell dataset was modeled as a Gaussian mixture with three components (k = 3). To estimate the mixture distribution parameters, an expectation-maximization algorithm using the normalmixEM function from the mixtools r-package (version 1.2.0) was applied ([363]115). A Bayesian information criterion was computed for the specified mixture models of clusters (from 1 to 10) using the mclustBIC function from the mclust r-package (version 5.4.9) ([364]77). Three underlying structures were defined: alternative compartmentalization into A-type (with the most positive EV values), B-type (with the most negative EV values), and I-type (an intermediate-valued region with a different distribution) compartments. Two intersection values were defined at the intersection points between two compartments and used as standard thresholds to categorize the data into the three different compartments; that is, the A-type compartment was defined for EV values >=+0.1388, the I-type compartment was defined for EV values between +0.1388 and −0.01955, and the B-type compartment was defined for EV values between <=−0.01955. Then, we checked the overlap between our OCRs and the compartments identified using the findOverlaps function (default parameters) of the GenomicRanges r-package (version 1.50.2) (fig. S13B) ([365]84). When an OCR overlaps with two compartments, we assigned it to the compartment with a higher overlap. Public promoter capture Hi-C interaction data Detected interactions using the promoter capture Hi-C technology from Javierre et al. ([366]52) were downloaded. From the promoter capture Hi-C peak matrix, we select all those interactions with a CHiCAGO score ≥5 in the naive B cell population, resulting in 197,442 interactions. The coordinates of the interacting regions were translated from GRCh37 to GRCh38 using the liftOver r-package (version 1.18.0). Then, we check the overlap between our interactions (OCR-gene associations) and their interactions using the findOverlaps function (minoverlap = 6) of the GenomicRanges r-package (version 1.50.2) (fig. S13C showing some of the overlapping interactions) ([367]84). Web application The Shiny r-package (version 1.8.0) ([368]116) was used to generate a user-friendly web interface for browsing called B-rex ([369]https://translationalbio.shinyapps.io/brex/). For scRNA-seq, the ShinyCell template (version 2.1.0) was used ([370]117). The Shiny app is hosted on [371]https://www.shinyapps.io/. Acknowledgments