Abstract

   Dysregulation of early B cell lymphopoiesis—the process guiding
   cellular immunity development—can lead to malignancy, making it crucial
   to understand its regulatory mechanisms. We generated a multiomics
   resource comprising paired chromatin accessibility and gene expression
   profiles across eight human B cell precursor populations, providing a
   detailed characterization of early human B cell development.
   Integrative analysis revealed highly cell type–specific regulatory
   elements and enabled the reconstruction of the gene regulatory network
   governing differentiation. We identified putative candidate regulons,
   such as ELK3, enriched in pro–B cells and potentially involved in cell
   cycle progression. Regulons from bulk data were projected onto
   single-cell data, validating their activity and refining the regulatory
   landscape. This resource enabled identification of active regulatory
   programs and transformation-associated states in B cell acute
   lymphoblastic leukemia. The publicly available atlas provides a
   valuable resource for understanding B cell development and disease,
   supporting future efforts to decode regulatory programs in immunity and
   hematologic malignancies.
     __________________________________________________________________

   Regulatory landscape of human B cell lymphopoiesis provides insights
   into the pathogenesis of B cell acute lymphoblastic leukemia.

INTRODUCTION

   B cell lymphopoiesis gives rise to a diverse repertoire of peripheral B
   cells, antibody-producing cells capable of mediating protection against
   pathogens while remaining tolerant of self-antigens. Before migration
   to the periphery, B cell development occurs in the bone marrow
   following a complex process governed by the sequential rearrangement of
   the immunoglobulin (Ig) heavy and light chain genes and their
   expression and subsequent assembly of their encoded proteins into the B
   cell receptor (BCR) ([72]1, [73]2). This highly ordered and controlled
   process is divided into several cell stages or B cell subpopulations
   ([74]3, [75]4). The transcriptional profile of each cell subpopulation
   emerges from an underlying gene regulatory network (GRN) in which a
   limited number of transcription factors (TFs) and cofactors regulate
   each other and their downstream target genes ([76]5). Several TFs
   orchestrating the B cell development have been described ([77]2,
   [78]6). Key lineage TFs, such as IKZF1 (Ikaros family zinc finger 1)
   and PU.1, are implicated in the initial commitment toward the lymphoid
   lineage ([79]7, [80]8). PAX5 (paired box 5), EBF1 (early B cell factor
   1), and E2A [TCF3 (transcription factor 3)] are regulators of B cell
   lineage commitment ([81]9–[82]11), while IRF4 (interferon regulatory
   factor 4), IRF8, and FOXO1 (forkhead box O1) are generally associated
   with Ig rearrangement ([83]12, [84]13). Last, OCT2 [POU2F2 (POU class 2
   homeobox 2)] and OBF1 (OCT-binding factor 1) are crucial TFs that drive
   the migration of immature B cells to the periphery ([85]14, [86]15).
   Most of this regulatory knowledge is derived from studies in mice, and
   thus, it is unclear whether the mechanisms identified follow the same
   regulatory programs as those of humans ([87]16).

   The integrated analysis of paired chromatin accessibility and gene
   expression profiling across various cell types offers a powerful
   approach to reconstructing GRNs and identifying cell type–specific
   regulons (TFs and their associated set of targets) ([88]17–[89]19), a
   crucial approach to deciphering the regulation programs in humans. Such
   an analysis involves predicting cis-regulatory elements (CREs),
   identifying potential TFs, and linking the CREs to their candidate
   target genes. This strategy was followed to provide initial insights
   into the cis-regulation of mouse hematopoiesis ([90]19), capturing the
   key TFs committing the B cell lineage, and in the study of
   mixed-phenotype acute leukemia, revealing RUNX1 (runt-related
   transcription factor 1) as a putative regulator ([91]20).

   More recent advances in multiomics single-cell approaches have allowed
   GRN inference, applied to study human peripheral blood mononuclear
   cells ([92]17) and plasma cell differentiation in the human tonsil
   ([93]21). Furthermore, a detailed characterization of B cell
   subpopulations in mice and human lymphopoiesis has been obtained using
   single-cell technology approaches ([94]3, [95]4, [96]22, [97]23).
   Despite notable progress, the understanding of the regulatory
   mechanisms of this process remains incomplete, with current single-cell
   multiomics experiments providing insufficient depth (detailed
   profiling) and breadth (enough individuals) to conduct detailed
   epigenetic profiling of the early stages of human B cell
   differentiation.

   To address this gap and to decipher the GRN of early B cell
   differentiation in humans, we generated comprehensive chromatin
   accessibility and transcriptional profiles corresponding to the
   stepwise progression of B cell precursors (BCPs). Our study included
   eight cell subpopulations across 13 healthy donors. A detailed joint
   analysis of paired high-throughput data [bulk RNA sequencing (RNA-seq)
   and assay for transposase-accessible chromatin using sequencing
   (ATAC-seq)] allowed the generation of a comprehensive GRN of B cell
   differentiation, capturing cell type specificity and dynamism of
   regulons, also recovering the master regulators of this process. Our
   resource not only recapitulated well-known regulatory elements but also
   revealed candidate regulators of early B cell differentiation, such as
   ELK3, which appears to be associated with pro–B cell cycle progression.
   Next, we further explored the inferred regulatory programs at a
   single-cell resolution, which confirmed the observed regulatory
   dynamics and enabled a finer-grained definition of regulon-based early
   B cell differentiation. Last, we leveraged this detailed
   characterization of normal B cell development to deconvolve B cell
   malignancies, revealing transformation-associated cellular states and
   highlighting subtype-specific regulatory programs in certain forms of B
   cell acute lymphoblastic leukemia (B-ALL). This detailed and extended
   portrait of regulatory elements has been made publicly available in a
   user-friendly web interface for browsing (B-rex;
   [98]https://translationalbio.shinyapps.io/brex/), representing a rich
   playground to further unravel the regulatory mechanisms of B cell
   differentiation in health and disease.

RESULTS

Profiling of early human B cell differentiation stages by RNA-seq and
ATAC-seq

   To investigate early human B cell differentiation, we harvested bone
   marrow and peripheral blood from the same healthy individuals [n = 13;
   mean age, 21.4 (20 to 23)]. Using fluorescence-activated cell sorting
   (FACS), we isolated hematopoietic stem progenitor cells (HSPCs), common
   lymphoid progenitors (CLPs), pro–B cells, pre–B cells, and immature B
   cells from each bone marrow sample and transitional B cells and naïve B
   cells (CD5+ and CD5−) from peripheral blood specimens ([99]Fig. 1A and
   fig. S1A). All these populations were characterized using paired
   transcriptomic (RNA-seq) and chromatin accessibility (ATAC-seq) assays
   (fig. S2A and table S1).

Fig. 1. Characterization of early B cell differentiation through chromatin
accessibility and gene expression profiling.

   [100]Fig. 1.
   [101]Open in a new tab

   (A) Major BCP subpopulations from the human bone marrow and peripheral
   blood. The marker strategy followed to isolate eight B cell
   subpopulations from 13 healthy individuals is shown. Markers strongly
   expressed by the given B cell subpopulation are depicted in black,
   whereas a lack of expression is shown in gray. (B) PCA of gene
   expression (79 samples), (C) chromatin accessibility (78 samples), and
   (D) TFBS activity score (78 samples). Each dot represents a sample
   colored by cell subpopulation. (E) Heatmap representation of TFBS
   activity score for key B cell regulators. Samples are shown by columns
   (n = 78) and grouped by cell subpopulation, and TFs are shown by rows
   (n = 12). (F) Representation of top 10 enriched GO terms derived from
   the transcriptional changes over B cell transitions. The NES is shown
   in a red-blue color range for each term and the −log[10] of the
   adjusted P value as dot size. Biological functions are grouped into
   four main terms. (G) Heatmap representation of the gene expression
   profile for 58 well-known genes associated with B cell lymphopoiesis.
   Samples are shown by columns (n = 79; grouped by cell subpopulation),
   and genes are shown by rows (n = 58; grouped by biological processes).
   Normalized gene expression levels are represented. (H) Heatmap
   representation of the gene set activity score (GSVA) computed for each
   human B cell subpopulation signature (n = 8; in rows) in each mouse
   sample (n = 17; in columns). The GSVA score represented in the heatmap
   is scaled by column to highlight the human B cell subpopulation with
   higher enrichment in each mouse B cell subpopulation (in dark blue).

   Principal components analysis (PCA) of both chromatin accessibility and
   gene expression data revealed consistent and well-defined agreement of
   cell types across individuals, effectively recapitulating the dynamic
   process of B cell differentiation ([102]Fig. 1, B and C). To dissect
   the TF-mediated regulatory mechanisms underlying this process, we first
   inferred the TF activity by identifying TF binding at accessible or
   open chromatin regions (OCRs). On the basis of these TF binding sites,
   we estimated the TF activity for each sample (TFBS activity score; see
   Materials and Methods), enabling the identification of highly active
   regulators within distinct cell subpopulations ([103]Fig. 1, D and E).
   The resulting TF activity profile showed clear cell type–specific
   clustering and recapitulated known regulatory circuits associated with
   B cell differentiation ([104]6). Among the transcriptional programs
   identified, pro–B cells and pre–B cells displayed the most distinct
   programs, underscoring key developmental transitions such as Ig
   rearrangement mediated by FOXO1 and IRF4/8, with further B cell
   commitment driven by EBF1 and PAX5. In HSPCs, stemness-associated
   TFs—GATA2, TAL1, and MEIS1—were highly active, while CLPs showed an
   increased activity of lymphoid-specifying TFs such as TCF3 and Ikaros
   (IKZF1). At later stages of differentiation, TFs involved in B cell
   maturation, including POU2F2 and SPIB, became prominent ([105]Fig. 1E
   and fig. S3 for the TFBS profile of all TFs).

   To further explore the molecular processes captured by our dataset, we
   examined biological pathways enriched in transcriptional changes over B
   cell transitions ([106]Fig. 1F and table S2), as well as the
   transcriptional and chromatin accessibility profiles of well-known
   markers of B cell differentiation ([107]Fig. 1G and
   [108]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex for the chromatin
   accessibility profile). We observed a significant enrichment in
   proliferation and differentiation pathways for CLPs that were also
   preserved in pro–B cells and pre–B cells ([109]Fig. 1F). These
   pathways, in the early stages of B cell differentiation, are mainly
   regulated by the interleukin-7 receptor signaling, highly expressed in
   CLPs, pro–B cells, and pre–B cells, to later switch to the antagonistic
   signaling by the pre–B cell receptor (pre-BCR), expressed in pre–B
   cells ([110]Fig. 1G) ([111]2, [112]4). For the transition from pro–B
   cells to pre–B cells, we observed an even higher enrichment of
   proliferation pathways ([113]Fig. 1F), associated with pre–B cell
   expansion (large pre-B or FrC′) controlled by the convergence of
   interleukin-7 receptor and the pre-BCR signaling ([114]24). We also
   identified enrichment for Ig heavy-chain and light-chain
   rearrangements, fundamental processes that organize B lymphocyte
   development. Here, we observed an enrichment of Ig-related pathways in
   the transition from CLPs to pro–B cells, a state when Ig rearrangement
   commences ([115]Fig. 1F), and observed increased gene expression for
   RAG1 and RAG2 in pro–B cells and pre–B cells ([116]Fig. 1G). The
   successful Ig rearrangements lead to BCR expression in immature B cells
   and their migration to peripheral blood, which is associated with a
   reduction in proliferation and an increase in migration pathways
   identified during the transition from pre–B cells to immature B cells
   ([117]Fig. 1F).

   With a wealth of knowledge on B cell differentiation derived from
   studies described in mice, we next examined the conservation and
   translatability of our human B cell transcriptional signatures (fig.
   S4) to those equivalent in mice (fig. S5) ([118]19). To achieve this,
   we performed gene set variational analysis (GSVA) to calculate gene set
   activity scores for each human B cell subpopulation signature across
   individual mouse samples. These unitless scores, where higher values
   indicate greater gene set activity, enabled robust cross-species
   transcriptional comparisons. We observed a general concordance between
   species ([119]Fig. 1H), further supporting the evolutionary
   conservation of transcriptional programs across human and mouse B cell
   differentiation and development. In summary, this comprehensive view of
   human B cell differentiation demonstrated well-preserved programs
   across individuals, aligning closely with those described in mice and
   highlighting the highly organized processes subjacent to functional B
   cell generation.

Defining the GRN and TF regulons driving early B cell differentiation

   We next combined chromatin accessibility and gene expression data to
   derive the underlying GRNs driving early B cell differentiation in
   healthy humans. We defined a computational workflow to decipher the GRN
   from the RNA-seq and ATAC-seq coordination (fig. S2B). Briefly, we
   first inferred the CREs (62,559 OCR-gene pairs) and trans-regulatory
   elements (2,299,328 TF-OCR pairs derived from a footprinting analysis).
   Next, we refined the resulting TF-OCR-gene links (1,823,177 links) to
   focus on the regulatory programs driving B cell differentiation by (i)
   retaining only those interactions involving differentially expressed
   genes (DEGs), promoter OCRs, and differentially accessible enhancer
   OCRs across cell state transitions—under the premise that dynamic
   regulatory elements are central to lineage specification—and (ii)
   excluding interactions in which the TF was not differentially expressed
   (n = 469,073), correlations between TF gene expression and target genes
   were not statistically significant (P > 0.05; n = 186,837), or logical
   inconsistencies in correlation directionality were observed
   (n = 117,846). This process resulted in a high-confidence GRN with
   377,268 TF-OCR-gene interactions, involving 169 TFs, 7310 target genes,
   and 16,074 OCRs (table S3 and fig. S6, A to C). Further details are
   provided in Materials and Methods.

   After reconstructing the global GRN architecture, we defined regulons
   as sets of genes associated with a specific TF, inferred through
   TF-OCR-gene interactions. Each regulon represents a functional unit
   within the GRN, consisting of a TF and its co-regulated target genes
   and thereby constituting a coherent regulatory program operating within
   a specific cellular context. The regulon’s target genes were divided
   into two groups on the basis of their correlation with TF activity:
   positively correlated targets, likely activated by the TF, and
   negatively correlated targets, likely repressed by the TF. To explore
   the cell type specificity of these regulons, we computed the regulon
   specificity score (RSS) on the basis of the Jensen-Shannon divergence
   (JSD) for both predicted target regions and target genes ([120]17,
   [121]25). Intuitively, the JSD measures the difference between two
   probability distributions, in this context, how the accessibility or
   gene expression profile of a regulon’s targets in one cell type
   diverges from the average profile across all cell types. As a result, a
   high RSS indicates that the regulon’s activity profile is highly
   distinct and concentrated in a particular cell type, reflecting strong
   regulatory specificity.

   This analysis revealed the high cell type specificity of TF regulons
   for both activated and repressed target genes ([122]Fig. 2, A and B,
   and fig. S7A). Specifically, among the regulons with activating
   functions, we identified five main clusters according to the TF gene
   expression and RSS. Several regulons were associated with HSPCs,
   including TAL1, MEIS1, MECOM, HLF, and GATA2 ([123]Fig. 2A, cyan
   cluster; termed “HSPC regulons”). Others showed enrichment throughout
   differentiation up to the pre–B cell stage, such as GFI1, RUNX1, and
   MYB ([124]Fig. 2A, blue cluster; termed “early B cell regulons”). A
   smaller group characterized pro–B cells, including ELK3 and BCL6B (B
   cell CLL/lymphoma 6B) ([125]Fig. 2A, green cluster; termed “pro–B cell
   regulons”), while well-known factors such as SPI1, IRF4, TCF3, MEF2D
   (myocyte enhancer factor 2D), and EBF1 were mainly enriched in pre–B
   cells and immature B cells ([126]Fig. 2A, pink cluster; termed “pre–B
   cell/immature B cell regulons”). Several regulons associated with late
   differentiation stages, including PAX5, CEBPB, SPIB, etc. ([127]Fig.
   2A, black cluster; termed “late B cell regulons”). In addition, 51 TF
   regulons exhibited repressive activity, including PAX5, EBF1, and TCF3,
   according to previous knowledge ([128]Fig. 2B) ([129]26). To further
   understand the biological and molecular functions of the regulons, we
   performed a pathway enrichment analysis on their target genes
   ([130]Fig. 2C and table S4). We observed a significant enrichment of
   pathways related to B lymphopoiesis and cell cycle across all regulons.
   Early-stage regulons (HSPC and early B cell) showed a significant
   enrichment of pathways involved in cell survival, proliferation,
   differentiation, and migration, with proliferation and survival
   pathways particularly prominent in HSPC regulons. Pre–B cell/immature B
   cell regulons were enriched in pathways related to cellular growth,
   consistent with the generation of large pre–B cells, while late B cell
   regulons were enriched for bioenergetic pathways. Together, our
   multiomics strategy enabled the determination of the GRN and associated
   regulons governing the entire B cell differentiation process,
   recovering a large set of cell type–specific regulons broadly organized
   into five major clusters.

Fig. 2. Regulatory landscape of TF regulons orchestrating early B cell
differentiation.

   [131]Fig. 2.
   [132]Open in a new tab

   (A and B) Heatmap/dot plot showing, for each TF regulon with an
   activator or repressor profile, respectively, the TF gene expression on
   a color scale and the RSS of target regions (OCRs) on a size scale. A
   high RSS score (high dot size) and high TF gene expression (color prone
   to red) indicate that the regulon is highly specific to the
   corresponding cell type. TF regulons are shown by rows, and B cell
   subpopulations are shown by columns. TF regulons are clustered into
   five main groups colored as cyan, blue, green, pink, and black
   (right-side colored bar). (C) Circus plot summarizing the main
   biological pathways enriched for each regulon. On the bottom part, TF
   regulons are shown. TF regulons are clustered according to groups
   identified in (A) (right-side colored bar). On the top part, the
   enriched biological pathways are shown, grouped into five blocks
   describing bioenergetic processes (in pink); cell growth (in purple);
   hematopoietic and B cell lineage and cell cycle (in yellow);
   differentiation, proliferation, survival, and migration (in blue); and
   proliferation and survival (in turquoise). (D) Multidimensional scaling
   representation from the Pearson correlation of csGRNs. Color code by
   cell subpopulation. (E) Heatmap representation of degree [left; total
   number of edges (connections) that a TF has associated] and betweenness
   (right; influence of a given TF over the flow of information in the
   network) centrality measures computed within each csGRN. TFs with the
   highest degree are shown in yellow, and those with the highest
   betweenness are shown in cyan. Each column corresponds to a csGRN, and
   each row corresponds to a TF regulon. TF regulons are clustered
   according to patterns identified in (A) (right-side colored bar). The
   top 10 TF regulons of each csGRN based on degree and betweenness are
   highlighted in red.

Uncovering the master regulators of B cell differentiation through network
topology

   On the basis of the defined regulons, we next performed a network
   topology analysis to provide insights into the master regulators.
   Intuitively, network topology analyses are based on the structural
   organization of the GRN, highlighting key regulators on the basis of
   their position and connectivity within the network ([133]27, [134]28).
   Our topological analysis was based on two measures of network
   centrality: (i) degree, which denotes the total number of edges
   (connections) that a TF has associated, and (ii) betweenness, which
   quantifies the influence of a given TF over the flow of information in
   the network (table S5; see Materials and Methods). As expected, when we
   explored the top 10 ranked TFs by degree, we obtained several TFs
   involved in basic cellular survival and proliferation [several E26
   transformation–specific (ETS) and Krüppel-like factor (KLF) family
   ([135]29), TAF1 (TATA-box binding protein–associated factor 1)
   ([136]30), and E2F3 ([137]31) TFs]. Meanwhile, when we explored the top
   10 ranked TFs on the basis of node betweenness, we recovered critical
   TFs of B cell differentiation such as EBF1, TCF3, IRF4, and BCL11A
   ([138]6, [139]32) and some ETS and KLF family TFs related to B cell
   lymphopoiesis, including ERG (ETS-related gene), KLF13, ELK3, and ELK4
   ([140]33–[141]35).

   Aiming to identify the specific master regulators of each cell
   subpopulation, we derived a specific GRN for each cell subpopulation
   [referred to as a cell type–specific GRN (csGRN); figs. S2B and S6] by
   incorporating chromatin accessibility and gene expression profiles
   within each cell type. For example, the HSPC-specific GRN included only
   interactions where target genes were highly expressed and linked with
   accessible OCRs in HSPC samples.

   An initial visual inspection of the retrieved csGRNs, projected into a
   two-dimensional space on the basis of their pairwise distances,
   revealed that they also recapitulated the differentiation trajectories
   ([142]Fig. 2D). Next, we conducted the topological analysis for each
   csGRN by exploring the centrality degree and betweenness measures
   ([143]Fig. 2E). We observed how the csGRN topology generally mirrored
   the regulon specificity shown in [144]Fig. 2A, as those TFs with low
   expression or chromatin accessibility in a specific cell state were
   excluded from the corresponding csGRN. Among TFs with the highest
   values of degree and betweenness, we recovered the primary TFs
   implicated in B cell commitment, differentiation, maintenance, or Ig
   recombination, which include TCF3, PAX5, IRF4, EBF1, and LEF1
   ([145]Fig. 2E, in red) ([146]1, [147]2). Moreover, this approach
   highlighted the crucial role of IKZF1, SPB1, IRF2, and FLI1 in the
   earliest stages of differentiation, reflected by their high degree and
   betweenness across CLPs, pro–B cells, and pre–B cells, not previously
   observed under the regulon specificity analysis ([148]Fig. 2, A and E).

   Focusing on pro–B cell regulons ([149]Fig. 2, A and E; green cluster),
   we identified ELK3, PPARA (peroxisome proliferator–activated receptor
   α), and BCL11A as putative key regulators in the transition from pro–B
   cell to pre–B cell stage. The essential role of BCL11A in lymphoid
   development has already been reported ([150]32), while the implications
   of PPARA and ELK3 in B cells remain unknown. To note, the ETS-factor
   ELK3 has been associated with pro–B cell expansion in mice ([151]35),
   defined as a regulon in large and small B cells (FrC′/D—pre–B cells) in
   the bursa of Fabricius of young broilers ([152]36), and recently
   described in human B cell differentiation ([153]37). Overall, our
   topological analysis of csGRNs not only aligned with the functional
   cell type specificity of regulons but was also essential in revealing
   the master regulators of B cell differentiation, successfully
   recovering well-known TFs and highlighting putative candidate
   regulators, such as ELK3, that may play a key role in B cell
   lymphopoiesis.

Characterization of the ELK3 regulon showed a putative key role in the
differentiation and proliferation regulation in pro–B cells

   Integrating the TF regulons and comprehensive GRN analysis, we further
   examined the specific regulatory activities of key human B cell TFs
   globally. As an example and positive control of the GRN analysis, we
   evaluated the EBF1-specific regulon (fig. S7, B to E). EBF1 is a
   well-established regulator of B cell differentiation, and among its
   predicted upstream regulators, we identified several known B cell TFs,
   including PAX5, TCF3, FOXO1, SPIB (Spi-1/PU.1–related), RUNX1, and
   CEBPA (fig. S7B), all previously reported to regulate EBF1. In
   addition, within the EBF1 downstream-regulated genes, we also confirmed
   previously established target TFs such as PAX5, TCF3, and FOXO1 (fig.
   S7C) ([154]38). To further unveil the biological functions of the EBF1
   regulon, we performed a pathway enrichment analysis on its activated
   and repressed target genes. This approach identified a significant
   association of the EBF1 regulon–activated genes with B cell activation,
   proliferation, and differentiation pathways and the EBF1
   regulon–repressed genes with metabolic and homeostatic processes (table
   S6). Last, we explored the pathway analysis of the EBF1 regulon for
   each csGRN, resolving pathways significantly enriched in the pro–B
   cell, pre–B cell, immature B cell, transitional B cell, and naïve B
   cell csGRNs (fig. S7, D and E, and table S7). Specifically, this
   analysis highlighted the involvement of EBF1 regulon–activated genes in
   regulating cellular division in pro–B cells and pre–B cells.

   After confirming that our analysis recapitulates the known regulatory
   biology of EBF1, we next assessed the role of a candidate TF, ELK3.
   ELK3 showed high gene expression levels and TF activity in HSPCs, CLPs,
   and pro–B cells ([155]Fig. 3, A and B) and emerged as a regulon within
   pro–B cells from our GRN analysis ([156]Fig. 2). We identified 50 TFs
   as potential regulators of ELK3, which included ERG, STAT5A/B (signal
   transducer and activator of transcription 5A/B), E2F3, ETS2, and ETV6
   (ETS variant transcription factor 6) as activators and SPIB
   (Spi-1/PU.1–related) and POU2F2 as repressors ([157]Fig. 3C). The ELK3
   regulon demonstrated 1339 candidate genes positively regulated and 1294
   candidate genes negatively regulated by ELK3. Within the regulated
   genes, 52 were TFs, including key regulators of B cell differentiation
   such as PAX5, IKZF1, BCL11A, MEF2A, RUNX1, STAT5A, SPI1, and SPIB
   ([158]Fig. 3D). To unveil the biological functions of the ELK3 regulon,
   we performed a pathway enrichment analysis of the activated and
   repressed genes. This approach revealed a significant association of
   the candidate ELK3 regulon–activated genes with the cell cycle
   regulation, including the enrichment of several pathways related to DNA
   replication, unsaturated fatty acid metabolism, and MAPK
   (mitogen-activated protein kinase) and ERK (extracellular
   signal–regulated kinase) cascade (table S8). On the other hand, the
   candidate ELK3 regulon–repressed genes showed a significant enrichment
   of several B cell activation, proliferation, and differentiation
   pathways (table S9). When we assessed the pathway analysis of the ELK3
   regulon for each csGRN-resolved pathway, there was a significant
   enrichment in the HSPCs and pro–B cells. Specifically, it highlighted
   the involvement of ELK3 regulon–activated genes in regulating metabolic
   and cellular processes by the enrichment of the fatty acid metabolic
   process, PI3K (phosphatidylinositol 3-kinase)/PKB (protein kinase B;
   Akt) signal transduction regulation, and MAPK cascade in both HSPCs and
   pro–B cells. A significant enrichment of cell cycle pathways was
   observed in pro–B cells ([159]Fig. 3E and table S10). The ELK3
   regulon–repressed genes showed specific enrichment in pro–B cells,
   associated with several B cell activation, proliferation, and
   differentiation pathways ([160]Fig. 3F and table S10). Overall, our
   findings identify ELK3 as a previously underappreciated TF in B cell
   lymphopoiesis, which may play a crucial role in regulating the cell
   cycle and advancing B cell differentiation during the pro–B cell stage.

Fig. 3. ELK3 regulon characterization.

   [161]Fig. 3.
   [162]Open in a new tab

   (A) Box plot showing the distribution of ELK3 gene expression and (B)
   ELK3 TFBS activity score by cell subpopulation. Significant gene
   expression changes between cell subpopulation transitions assessed by
   the Wilcoxon test are shown. Significant changes were defined as P <
   0.05 (NS., no significant; *P = 0.05, **P = 0.01, ***P = 0.001). (C and
   D) Heatmap representation of gene expression for TF regulators of ELK3
   and TF regulated by ELK3, respectively. Samples are shown by columns
   (grouped by cell subpopulation), and TFs are shown by rows. Normalized
   gene expression levels are represented. On the right side, ρ values of
   Spearman correlation for each TF versus ELK3 are shown. On top, a bar
   plot representation of the ELK3 gene expression is shown. (E)
   Representation of the main enriched GO terms derived from the pathway
   analysis of the ELK3 regulon–activated genes for each csGRN. The
   adjusted P value is shown in a red-blue color range, and the number of
   genes enriched within the term (counts) is shown by dot size. (F)
   Representation of the main enriched GO terms derived from the pathway
   analysis of the ELK3 regulon–repressed genes for each csGRN.
   Significant terms only appeared in the assessment of csGRN derived from
   pro–B cells. For the pro–B cell csGRN enriched terms, the adjusted P
   value (red-blue color range), gene ratio, and counts (dot size) are
   shown.

Single-cell translation fine-tunes and validates bulk-derived regulons
dynamics

   To gain a finer-grained understanding of the transcriptional regulation
   of the early B cell differentiation, we analyzed the bulk-based defined
   regulons at the single-cell level. To this end, we interrogated two
   single-cell datasets: a public dataset including healthy human bone
   marrow cells from hematopoietic stem progenitor cells (HSPCs) to mature
   B cells and plasma cells from five donors (phs002371.v7.2; [163]Fig.
   4A) ([164]39) and an internally generated dataset including bone marrow
   CD34+ hematopoietic progenitors isolated from five healthy donors
   [in-house dataset; mean age, 22.4 (21 to 24)] ([165]Fig. 4B; figs. S1B
   and S8, A to I; and table S1). This approach allowed us to explore the
   specific TF expression and the activity of regulons over the entire B
   cell differentiation, including those cellular states not currently
   accessible by flow cytometry–based isolation.

Fig. 4. Translation of the early B cell differentiation regulatory landscape
to a single-cell resolution.

   [166]Fig. 4.
   [167]Open in a new tab

   (A) scRNA-seq UMAP dimensionality reduction of 10,529 human bone marrow
   cells, recovering the B cell lymphopoiesis (public dataset
   phs002371.v7.2). (B) scRNA-seq UMAP dimensionality reduction of 28,891
   human CD34+ cells, drawing the early B cell lymphopoiesis (in-house
   dataset). (C and D) Heatmap showing, for each cell (columns; grouped by
   cell subpopulation), the estimated gene set activity score of each TF
   regulon–activated gene set by calculating the AUCell score (rows) for
   the public and in-house datasets, respectively. Row-scaled AUCell score
   values are represented. TF regulons are clustered and colored according
   to TF regulon groups identified in [168]Fig. 2A (left-side colored
   bar). (E and F) Dot plot showing, for each cell subpopulation, the
   average gene expression and the percentage of expressing cells for
   PPARA, ELK3, and NR3C1 (TF regulons identified in [169]Fig. 2 as pro–B
   cell–type specifics) and MYBL2, FOXM1, and E2F7 (cell cycle–related
   TFs) in the public and in-house datasets, respectively. The average
   gene expression is shown in a color range, and the percentage of
   expressing cells is indicated by dot size. (G to I) Box plot
   representation of the AUCell gene set activity score for the MYBL2,
   FOXM1, E2F7, ELK3, NR3C1, and PPARA TF regulon–activated gene sets by
   cell subpopulation for public (top) and in-house (bottom) datasets.

   We estimated the regulon activity per cell and per regulon by
   calculating the AUCell score for each TF regulon–activated gene set.
   Briefly, AUCell scores gene set activity in single cells by ranking
   expression and calculating the AUC of gene enrichment per cell.
   Therefore, we quantified the enrichment of the positively regulated
   genes of each regulon within all expressed genes for each cell. In both
   datasets, the resulting regulon enrichment patterns closely mirrored
   those identified in bulk analysis ([170]Figs. 2A and [171]4, C and D),
   showing a moderate-to-strong correlation in more than 79% of regulons
   (fig. S9). Despite the similarity, the granularity of the single-cell
   data uncovered an increased enrichment of the regulons grouped in the
   blue cluster with cycling cells according to one of the main biological
   pathways associated with this cluster (cellular proliferation; see
   [172]Fig. 2C). Within this group of TFs, we identified MYBL2, FOXM1,
   and E2F7 showing elevated gene expression and regulon activity in
   cycling cells ([173]Fig. 4, E to G).

   In addition, we observed the increased gene expression of ELK3 and
   NR3C1 in pro–B cells ([174]Fig. 4, E and F) along with a refined
   resolution of their regulon activity profiles ([175]Fig. 4H). The NR3C1
   regulon, which was highly active in pro–B cells and pre–B cells
   ([176]Fig. 2A), showed the highest activity in cycling pre–B cells. In
   the case of the ELK3 regulon, which was highly active throughout the
   early stages of B cell differentiation—from hematopoietic stem cells
   (HSCs) to pro–B cells—and significantly decreased during the transition
   from pro–B cells to pre–B cells ([177]Fig. 2A), a delayed reduction in
   activity was observed, with the most notable drop occurring between
   cycling pre–B cells and pre–B cells ([178]Fig. 4H). PPARA displayed a
   similar pattern of regulon activity to ELK3 ([179]Figs. 2, A and E, and
   [180]4I); however, its gene expression was slightly detected in CLPs
   and pre-pro–B cells and was largely absent in the other cell
   subpopulations ([181]Fig. 4, E and F). In summary, the single-cell
   translation not only validated the cell type specificity of regulons
   but also provided high-resolution insights into their activity dynamics
   throughout B cell differentiation.

Identification of relevant B cell subpopulations and regulons within B-ALL
subtypes

   Given the comprehensive regulatory landscape of our human B cell
   profiling, we next turned our attention to the translatability of our
   results within a pathologic context. We hypothesized that the accurate
   characterization of the B cell differentiation would provide insights
   into the processes underlying the dysregulation of precursor B cell
   differentiation–related diseases, such as B-ALL.

   To investigate our hypothesis, we obtained a large public dataset
   including several B-ALL subtypes from both child and adult individuals
   (Li dataset; [182]Fig. 5A and fig. S10, A to C) ([183]40). Next, we
   aimed to investigate the distribution of each TF regulon activity
   (focused on activated genes and using GSVA) in each individual sample.
   Investigating the derived heatmap, we did not identify a perfect
   association between regulatory programs and the different genetic
   subtypes ([184]Fig. 5B). However, we found consistently increased
   activity scores in patients with B-ALL from the TCF3-PBX1 (pre–B cell
   leukemia homeobox 1) and MEF2D fusion subtypes within pre–B
   cell/immature B cell regulons (pink cluster). Also, within patients
   from the ETV6/RUNX1 subtype, a subtle increased activity was observed
   for some of the regulons classified as pro–B cell regulons (green
   cluster), including OTX1 (orthodenticle homeobox 1), IRX2 (iroquois
   homeobox 2), and AIRE (autoimmune regulator) (fig. S11).

Fig. 5. Uncovering B-ALL through the normal B cell differentiation regulatory
landscape.

   [185]Fig. 5.
   [186]Open in a new tab

   (A) Schematic representation of the B-ALL public Li dataset explored,
   including 469 individuals (childhood and adult) from three different
   cohorts. The PCA of gene expression is shown. Each dot represents a
   sample colored by B-ALL genetic subtype. (B) Heatmap representation of
   the gene set activity score (GSVA) of each TF regulon–activated gene
   set (rows) computed for each B-ALL sample (n = 469; in columns).
   Row-scaled GSVA score values are represented. TF regulons are clustered
   and colored according to TF regulon groups identified in [187]Fig. 2A
   (left-side colored bar). Patients with B-ALL are clustered according to
   B-ALL subtype (top colored bar). The sample age (adult versus
   pediatric) is shown for each sample (top colored bar). (C) ORA of B-ALL
   subtype signatures over the B cell subpopulation signatures. The dot
   size represents the −log[10](adjusted P value), and the color range
   represents the gene ratio. (D) ORA for RUNX1-ETV6 high-affinity binding
   sites (n = 711) over the consensus OCRs for each B cell subpopulation.
   The dot size represents the −log[10](adjusted P value), and the color
   represents the odds ratio. (E) ORA of B-ALL subtype signatures (rows)
   over the TF regulon–activated gene set (columns) defined by the bulk
   approach. The dot size represents the −log[10](adjusted P value), and
   the color range represents the gene ratio. Only TFs with at least one
   significant association are shown. TF regulons clustered and colored
   according to TF regulon groups identified in [188]Fig. 2A (top-side
   colored bar). (F to I) Box plot showing the distribution of the gene
   set activity score (GSVA) computed on the TF regulon–activated gene set
   for TCF3, MEF2D, FOXD3, SPI1, OTX1, IRX2, STAT5B, and FOSB TF regulons
   by B-ALL subtype. This analysis is shown for the Li dataset and two
   additional datasets for validation [Rainer and Iacobucci (Iaco.)], when
   possible.

   To further investigate those initial observations, we first aimed to
   compare what defines transcriptional B-ALL subtypes compared to those
   of normal BCPs. To this end, we analyzed the enrichment of the
   transcriptional signatures for each of the B-ALL subtypes (fig. S10, D
   and E, and table S12) and transcriptional signatures of normal BCPs
   (fig. S4) using an overrepresentation analysis (ORA). Our analysis
   confirmed a statistically significant association of MEF2D fusions and
   TCF3-PBX1 ALL with pre–B cells and immature B cells, ETV6-RUNX1/like
   enriched in pro–B cells, and pointed to a significant association of
   BCR-ABL1 (Abelson tyrosine-protein kinase 1) with pro–B cells, ZNF348
   (zinc finger and BTB domain containing 33) fusions with HSPCs, and the
   KMT2A (lysine methyltransferase 2A) fusion subtype with HSPCs/CLPs
   ([189]Fig. 5C). Moreover, when considering the chromatin accessibility
   data, we explored the enrichment of a set of 711 high-affinity binding
   sites for the ETV6-RUNX1 fusion defined by Wray et al. ([190]41) in our
   chromatin accessibility data, which confirmed again the significant
   association of the ETV6-RUNX1 ALL subtype with the pro–B cell state
   ([191]Fig. 5D).

   Once we observed significant associations between B-ALL subtypes and
   normal BCPs, we focused on identifying significant associations at the
   TF regulon level. To assess this, we calculated the enrichment of the
   transcriptional signatures of B-ALL subtypes in the activated target
   genes of regulons using the ORA. Our analysis uncovered specific TF
   regulons enriched in some of the B-ALL subtypes (see all regulons in
   [192]Fig. 5E). Some of the identified regulons were validated using two
   independent B-ALL public datasets: one including childhood ALL with
   different B-ALL subtypes (Rainer dataset, [193]GSE73578) ([194]42) and
   another including adult B-ALL with or without BCR-ABL1 mutation
   (Iacobucci dataset, [195]GSE13204) (fig. S10, F to I) ([196]43). Among
   those confirmed, we observed the expected elevated activity of the TCF3
   regulon in the TCF3-PBX1 subtype and the MEF2D regulon in the MEF2D
   fusion subtype, which was also increased in the TCF3-PBX1 subtype
   ([197]Fig. 5F). In addition, we validated an increased activity of
   FOXD3 and SPI1 regulons in both subtypes, among others ([198]Fig. 5G).
   Although not significant in the ORA ([199]Fig. 5E), we identified an
   increased activity of OTX1 and IRX2 regulons—both associated with the
   pro–B cell state; see [200]Fig. 2—in the patients with ETV6/RUNX1 ALL
   ([201]Fig. 5H). Last, in the BCR-ABL1/ph–like ALL subtype, we detected
   a significantly increased activity score for STAT5B and
   FOSB—characteristics of early progenitor cells (cluster in
   cyan)—regulons with a particularly strong increase in the ZNF384 fusion
   subtype ([202]Fig. 5I). In summary, by translating our comprehensive
   resource to the B-ALL context, we illustrated their potential to obtain
   a deeper understanding of the molecular underpinnings of B cell
   malignancies.

DISCUSSION

   We addressed the challenge of profiling gene expression and chromatin
   accessibility during the early stages of human B cell lymphopoiesis to
   elucidate the underlying GRN that orchestrates the differentiation
   process. To this end, to our knowledge, we compiled the most
   comprehensive sample collection to date, encompassing bulk-derived
   multiomics profiles of eight BCPs from 13 healthy donors
   (EGAS00001007296), publicly available for browsing at
   [203]https://translationalbio.shinyapps.io/brex/. This dataset,
   characterized by unprecedented granularity, provided the statistical
   power necessary for a detailed characterization of transcriptional and
   chromatin accessibility dynamics, allowing us to move beyond the
   transcription-based understanding of the gene regulatory landscape.

   The definition of the BCP subpopulations has been well established in
   mice—the Hardy fractions—([204]44); however, its translation to human
   surface markers remains controversial, and an unequivocal assignment is
   missing ([205]45). Here, we defined a sorting strategy that allowed the
   recovery of the human cell fractions equivalent to HSPCs, CLPs, pro–B
   cells (including FrB/C), pre–B cells (including FrC′/D), immature B
   cells (FrE), translational B cells (FrD), and naïve B cells (FrD)
   ([206]Fig. 1, A and H), providing an unparalleled resource to robustly
   understand the system and complete what has been previously achieved in
   mice or mature stages of human B cells ([207]3, [208]19, [209]46,
   [210]47). Although the defined strategy was accurate, we would like to
   note that, because of sorting limitations, the CLP fraction likely also
   contains early lymphoid progenitor cells (CD34^+CD38^+CD45RA^+CD133^+).

   Both gene expression and chromatin accessibility profiles delineate
   early B cell differentiation and recovered well-known biomarkers,
   biological processes, and key TFs ([211]Fig. 1). Chromatin
   accessibility and, more specifically, intronic and distal intergenic
   regions yield sharper distinctions between cell subpopulations than
   promoter regions or expression profiles (figs. S2A and S12, A and B),
   consistent with previous observations in the mouse immune system and
   human hematopoiesis ([212]19, [213]48). This increased variability
   supports the crucial role of enhancer elements in the B cell
   development regulation (fig. S12C).

   Given that gene expression is regulated by chromatin accessibility, in
   combination with other epigenetic factors ([214]49), we designed a
   custom joint analysis—drawing inspiration from a previous study using
   mouse bulk data ([215]19)—to reconstruct the GRN underlying early B
   cell differentiation (fig. S2). Although the computational approach
   involves several assumptions and may yield some false positive edges,
   the resulting GRN serves as a robust and valuable resource for
   hypothesis generation. This is supported by the following: (i) We used
   significant “OCR-gene” pairs derived from a large number of paired
   samples; (ii) we used footprinting, enabled by deep sequencing of
   ATAC-seq (more than 100 million reads per sample), to improve the
   estimation of active TF binding sites; (iii) the GRN recovers the
   already well-established regulators; and (iv) a specific GRN for each
   cell state can be derived, permitting us to investigate further the
   cell type specificity of the regulons ([216]Fig. 2). Moreover, we
   leveraged public datasets to confirm that the defined CREs are enriched
   by active histone marks ([217]50) and located within A-type
   compartments of chromatin conformation (fig. S13, A and B) ([218]51).
   Also, using chromatin conformation data from naïve B cells ([219]52),
   we validated several CREs associated with characteristic genes from the
   late stages of differentiation, including MS4A1 (CD20), CD79A/B, PAX5,
   and IKZF1 (fig. S13C). However, Hi-C or promoter Hi-C experiments
   covering the early B cell differentiation—currently unavailable—will be
   essential for refining the defined GRN and allowing further
   interrogation of the cis-regulatory architecture, possibly leveraging
   deep learning models. Also, chromatin immunoprecipitation sequencing
   (ChIP-seq) experiments can help to validate the regulons identified.

   The recovered GRN not only captured well-known regulators extensively
   studied in mice but also validated recently described genes
   contributing to human lymphoid lineage differentiation, such as MLXIP
   ([220]53). In addition, it provided a valuable classification of highly
   cell type–specific regulons and enabled the identification of master
   regulators, shedding light on understanding B cell differentiation and
   the transition to malignancy. Our GRN identified ELK3 as a putative
   candidate regulatory element of the early stages of B cell
   differentiation ([221]Fig. 2, A, C, and E), potentially involved in
   cell cycle progression mainly in pro–B cells ([222]Fig. 3). This TF has
   recently been defined as a regulon in large and small B cells
   (FrC′/D—pre–B cells) in the bursa of Fabricius of young broilers
   ([223]36), and it was associated with pro–B cell expansion in mice
   ([224]35). Furthermore, evidence supporting a role for ELK3 in human B
   cell differentiation has been reported recently ([225]37). Moreover,
   ELK3 has been described as a novel oncogene reported to be increased in
   several carcinogenic processes, including the ETV6-RUNX1–positive ALL,
   a kind of ALL that resembles the pro–B cell state and shows the
   down-regulation of the G[2]-M checkpoint pathway ([226]22,
   [227]54–[228]56). ELK3 inhibition has been revealed as a potential
   candidate therapy for tumor growth and metastasis in mice and prostate
   cancer cells ([229]57–[230]59). Further functional studies are needed
   to better understand the regulatory role of ELK3 in both healthy and
   disease contexts.

   To achieve a higher cellular resolution, which is not feasible with
   sorting strategies, we projected the defined bulk-derived regulons onto
   single-cell data. Analysis of two independent datasets further
   validated and refined the cell type specificity of TF regulons, which
   included the regulatory control of ELK3. Notably, MYBL2, FOXM1, and
   E2F7 regulons were identified as highly active in cycling cells,
   consistent with their known roles in regulating cell proliferation
   ([231]47, [232]60, [233]61). When we inferred regulons directly from
   the multiomics single-cell dataset (fig. S8, A to C, J, and K), we
   observed only 62 regulons (table S11), of which 34 were considered
   significant (21 activators, ρ > 0.7; 13 repressors, ρ < −0.70). Most of
   the activator regulons identified (17 of 21) matched those from our
   bulk approach, including HLF, MECOM, MEIS1, NFE2, PBX1, RUNX1/2, CEBPA,
   THRB, LYL1, FOS, FOSB, MYBL2, LEF1, TCF4, EBF1, and PAX5. However,
   despite the relevant overlap between both approaches, it became clear
   that single-cell data did not recapitulate the broader set of regulons
   identified through footprinting-based bulk analysis. This comparison
   underscores the need for integrative strategies that harness the depth
   of bulk data and the resolution of single-cell data to comprehensively
   define GRNs.

   Last, the translational relevance of our resource was demonstrated by
   leveraging the comprehensively defined regulatory landscape of early
   human B cell lymphopoiesis to study B-ALL malignancy. This revealed
   substantial heterogeneity in regulatory programs across B-ALL subtypes,
   with distinct associations between specific early B cell states and
   certain genetic subtypes, including MEF2D fusions, TCF3-PBX1, and
   ETV6-RUNX1/like. These findings support the concept of heterogeneous
   origins of B-ALL transformation and reflect the cell of origin or
   cellular state of malignant transformation, as recently described by
   Iacobucci et al. ([234]23) ZNF384 fusions were associated with HSPCs,
   potentially explaining the preference for allogeneic hematopoietic stem
   cell transplantation in this subtype, as well as the enrichment of
   HSC-like features in TCF3-ZNF384–positive patients ([235]62, [236]63).
   Similarly, KMT2A fusions showed a strong association with HSPCs and
   HSPC-associated regulons ([237]Fig. 5, C and E), supporting previously
   reported increases in lineage plasticity ([238]39). In addition,
   specific regulatory programs were identified and validated for various
   B-ALL subtypes, including the well-known increased activity of MEF2D
   and TCF3 in MEF2D fusions and TCF3-PBX1 subtypes, respectively
   ([239]64). We also identified candidate regulatory targets in other
   B-ALL subtypes, such as OTX1 and IRX2 in ETV6-RUNX1/like B-ALL. IRX2
   has also been described as aberrantly expressed in other B-ALL
   subtypes, such as BCR/ABL1 and hyperdiploid karyotypes ([240]65).
   Moreover, OTX homeobox genes have been previously associated with B
   cell differentiation in Hodgkin lymphoma ([241]66).

   Beyond these biological insights, our work introduces a potential
   strategy for patient stratification based on active regulatory
   programs, either as an alternative to or in combination with genetic
   classification. Patient-specific regulon activity profiles could offer
   critical insights into treatment response and enable more personalized
   therapeutic approaches, supporting the recently uncovered correlation
   between developmental states and asparaginase response in B-cell ALL
   ([242]67). To briefly explore this concept, we performed a preliminary
   exploration of regulon activity in relation to treatment response using
   two public datasets: one evaluating CART19 (CD19-targeted chimeric
   antigen receptor T cell) therapy ([243]68) and another assessing the
   response to SMAC (second mitochondria-derived activator of caspase)
   mimetics ([244]69). Preliminary results suggest that nonresponders
   exhibit higher activity of stemness and proliferative-related regulons
   compared to responders (fig. S14).

   Last, it is important to acknowledge that, despite the high resolution
   achieved and the promising translatability for understanding underlying
   mechanisms in malignant contexts, much of the initial analysis relied
   on predefined early B cell subtypes. While single-cell mapping improved
   the dynamic resolution of regulons, their definition and associated
   GRNs remained largely confined to established cell populations.
   Consequently, part of the study is still limited to known cell types,
   and the existence of additional, yet uncharacterized, subpopulations
   cannot be ruled out. In this context, future CITE-seq (cellular
   indexing of transcriptomes and epitopes by sequencing) experiments may
   provide deeper insights by enabling more refined discrimination and
   isolation of cell types ([245]70). In summary, the generation of a
   comprehensive robust atlas of regulatory elements governing early human
   B cell differentiation provides a powerful resource to deeply
   interrogate the GRNs and epigenetic components that distinguish healthy
   and disease cell states.

MATERIALS AND METHODS

   All data and codes used in the project can be found on GitHub
   ([246]https://github.com/TranslationalBioinformaticsUnit/bcell) and
   Open Science Framework (OSF; [247]https://osf.io/gswpy/).

Human sample collection

   Human bone marrow and peripheral blood samples from the same
   individuals were collected for 13 healthy donors [7 males and 6
   females; median age, 21 (20 to 23)] (table S1). Samples and data from
   donors included in the study were provided by the Biobank of the
   University of Navarra after informed consent was obtained and were
   processed following standard operating procedures. The Clinical
   Research Ethics Committee of the Clinica Universidad de Navarra
   approved the study (ethical approval 2021.035).

FACS sorting

   Mononuclear cells were purified from whole blood and bone marrow
   samples using Ficoll-Paque PLUS (GE Healthcare, IL). After that, eight
   immune cell populations, representing the lymphoid hematopoietic
   differentiation cascade along the B cell lineage from adult human bone
   marrow precursors and peripheral blood, were obtained from each sample.
   These cells were purified following the sorting strategy shown in table
   S1 and fig. S1A in a BD FACSAria II (BD Biosciences) and then analyzed
   by bulk RNA-seq and bulk ATAC-seq.

RNA-seq data generation

   Bulk RNA-seq was performed following the MARS-seq (massively parallel
   RNA single-cell sequencing framework) protocol adapted for bulk RNA-seq
   ([248]71, [249]72) with minor modifications. Briefly, cells were sorted
   in 100 μl of Lysis/Binding Buffer (Ambion), vortexed, and stored at
   −80°C until further processing. Poly-A RNA was selected with Dynabeads
   Oligo (dT) (Ambion) and reverse transcribed with AffinityScript
   Multiple Temperature Reverse Transcriptase (Agilent) using poly-dT
   oligos carrying a 7–nucleotide (nt) index. Up to eight samples with
   similar overall RNA content were pooled and subjected to linear
   amplification by in vitro transcription using the HiScribe T7 High
   Yield RNA Synthesis Kit (New England Biolabs). Next, RNA was fragmented
   into 250- to 350-nt fragments with RNA Fragmentation Reagents (Ambion)
   and dephosphorylated with FastAP (Thermo Fisher Scientific) following
   the manufacturer’s instructions. Next, partial Illumina adaptor
   sequences were ligated with T4 RNA Ligase 1 (New England Biolabs),
   followed by a second reverse transcription. Full Illumina adaptor
   sequences were added during final library amplification with KAPA HiFi
   DNA Polymerase (Kapa Biosystems). RNA-seq library quality controls
   consisted of quantification with a Qubit 3.0 Fluorometer (Life
   Technologies) and size profile examination using Agilent’s 4200
   TapeStation System. Libraries were sequenced in an Illumina NextSeq 500
   in a single end at a sequence depth of 10 million reads per sample.

RNA-seq quantification and normalization

   Read quality was assessed with FastQC software (Babraham Institute).
   Low-quality, short-read, and TruSeq3-SE adapters were trimmed using
   Trimmomatic ([250]73) under the following parameters:
   ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
   MINLEN:30. Reads were then mapped to the December 2013 Homo sapiens
   high-coverage assembly GRCh38 from the Genome Reference Consortium
   GRCh38 (hg38) using STAR (version 2.6.0a) ([251]74) and counted for
   each gene using htseq-count ([252]75) with the –m intersection-nonempty
   option and a GTF file from GRCh38 Ensembl annotation release 96,
   getting the information for a total of 58,884 genes. Genes with a
   minimum read count of 1 in all population replicates and an associated
   ensemble ID were retained (23,454 genes). A sample identified as an
   outlier has been removed (sample ID: D213-4). Counts were normalized
   under trimmed mean of M-values (TMM) normalization and voom
   transformation from the edgeR r-package (version 3.24.3) ([253]76).
   Correlation between samples was computed using Pearson correlation
   (fig. S10A).

RNA-seq data analysis

   PCA was used for data exploration ([254]Fig. 1B). Differential
   expression analysis was performed using limma (version 3.38.3) and
   edgeR (version 3.24.3) under two different approaches: (i) First is
   identifying biomarker genes for each cell subpopulation (all possible
   pair combinations). Genes with an adjusted P value ≤0.01
   (Benjamini-Hochberg) and fold change ≥1.5 at least in four contrasts
   were defined as significant (B cell subpopulation biomarkers; fig. S4).
   (ii) Second is identifying DEGs between transition states (HSC versus
   CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.). Genes
   with an adjusted P value ≤0.05 (Benjamini-Hochberg) and absolute fold
   change ≥1.5 were considered significant. A clustering analysis was
   performed to identify the different expression patterns within the DEGs
   over transitions (second approach) using a model-base clustering based
   on parameterized finite Gaussian mixture models [mclust r-package
   (version 5.4.5) under the following function and parameter: Mclust
   (modelName = “EII”)] (fig. S6A) ([255]77). Gene set enrichment analysis
   of Gene Ontology (GO) terms was performed using the ClusterProfiler
   r-package (version 4.6.0) ([256]78). The fold change of DEGs at
   different transitions was given as input to the compareCluster
   function, and the top 10 enriched terms were plotted using an in-house
   treeplot function ([257]Fig. 1F).

ATAC-seq data generation

   Accessible chromatin mapping was performed using FAST-ATAC-seq
   ([258]48) with minor modifications. Briefly, 10,000 cells were sorted
   in 1× phosphate-buffered saline (PBS) and 0.05% bovine serum albumin
   (BSA) and pelleted by centrifugation at 500 rcf for 5 min at 4°C with
   low acceleration and brake settings in a precooled swinging-bucket
   rotor centrifuge. All the supernatant but 5 μl was removed. Next, 25 μl
   of transposase mix [15 μl of 2× TD buffer (Illumina), 1 μl of TDE1
   (Illumina), 0.25 μl of 5% digitonin (Promega), and 8.75 μl of
   nuclease-free water] was added to the cells, and the pellet was
   disrupted by pipetting. Transposition reactions were incubated at 37°C
   for 30 min in an Eppendorf ThermoMixer with shaking at 450 rpm.
   Reactions were stopped at 4°C for 5 min. To release tagmented DNA,
   samples were incubated at 40°C for 30 min with 5 μl of cleanup buffer
   [900 mM NaCl (Sigma-Aldrich), 30 mM EDTA (Millipore), 2 μl of 5% SDS
   (Millipore), and 2 μl of proteinase K (NEB)]. DNA was purified using a
   2× SPRI beads cleanup kit (Agencourt AMPure XP, Beckman Coulter). Two
   sequential polymerase chain reactions (PCRs) were performed using KAPA
   HiFi DNA Polymerase and customized Nextera PCR indexing primers (IDT)
   ([259]79). The conditions of the first PCR were as follows: 5 min at
   72°C and 2 min at 98°C followed by nine cycles of 20 s at 98°C, 30 s at
   63°C, and 1 min at 72°C. Depending on the library concentration in this
   first PCR, a second PCR (2 min at 98°C followed by four to six cycles
   of 20 s at 98°C, 30 s at 63°C, and 1 min at 72°C) was performed, aiming
   for a library concentration in the range of 2 to 10 ng/μl. PCR products
   were purified using a 2× SPRI beads cleanup kit. Libraries were
   quantified, and their size profiles were examined as described above.
   Sequencing was carried out in two batches: the first one in an Illumina
   NextSeq 500 using paired-end, dual-index sequencing (Read1: 38 cycles;
   Read2: 38 cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 10 million
   reads per sample and second one in a HiSeq2500, v4 reagents using
   paired-end, dual-index sequencing (Read1: 125 cycles; Read2: 125
   cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 100 million reads per
   sample.

ATAC-seq quantification and normalization

   Read quality was assessed with FastQC software (Babraham Institute).
   After trimming NexteraPE-PE adaptor sequences and low-quality reads and
   cutting the reads from the second batch to 38 base pairs (bp) using
   Trimmomatic ([260]73) under the following parameters: CROP:38
   ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3
   SLIDINGWINDOW:4:15 MINLEN:15. Reads were then mapped to the hg38
   reference genome (GRCh38 Ensembl annotation release 96) using BWA-MEM
   ([261]80). Unmapped, unknown, random, ChrM mapping, duplicate reads and
   reads matching the black list [the blacklist includes problematic
   regions of the genome ([262]81)] were filtered out using samtools view
   -b -F 4 -F 1024 (samtools version 0.1.19) and Picard Tools (Picard
   MarkDuplicates; [263]http://broadinstitute.github.io/picard). BAM files
   from resequenced samples were merged using samtools merge. Paired-end
   reads were used to determine those regions with significant read
   coverage in contrast to the random background (regions known as peaks
   or OCRs) using MACS2 (function callpeak) with the following parameters:
   -g hs -f BAMPE -q 0.01 --nomodel --shift -100 --extsize 200 -B –SPMR
   ([264]https://github.com/taoliu/MACS) ([265]82). A minimum OCR size of
   20 bp was set as a threshold. To get the consensus OCRs for each cell
   subpopulation, we found the genome regions that are covered by at least
   n − 1 of the sets of peaks (n being the total number of samples for a
   specific cell subpopulation), and then we merged the nearby peaks with
   a minimum gap width of 31. To get the consensus OCRs in the whole
   population, we get the regions in the genome where the coverage is at
   least in 1 of the consensus OCRs from the different cell subpopulations
   and then merged the nearby peaks with a minimum gap width of 31,
   resulting in 105,530 ATAC-seq OCRs. These processes were done using the
   rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2)
   r-packages ([266]83, [267]84).

   BigWig files were generated for each BAM file using the bamCoverage
   function (--binSize 20 --normalizeUsing RPKM --effectiveGenomeSize
   2913022398) from deepTools (version 3.3.1) ([268]85). Then, a merged
   BigWig for each cell subpopulation was obtained through bigWigMerge
   from UCSC
   ([269]http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/). Also,
   BigBed files were generated from consensus BED files of each cell
   subpopulation using bedToBigBed from UCSC. These data were deposited in
   a public custom track from UCSC
   ([270]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex).

   Using the whole consensus OCRs, we compute the count matrix using the
   multicov function from bedtools (version 2.25.0) ([271]86). Those OCRs
   without a minimum of 10 counts for at least one sample and 15 total
   counts were excluded (155 OCRs), retaining 105,359. Then, we normalize
   the data using the TMM method from the edgeR r-package (version 3.30.3)
   ([272]76). One of the samples analyzed was identified as an outlier and
   excluded (sample ID: D215-5). Correlation between samples was computed
   using Pearson correlation (fig. S10B).

ATAC-seq data analysis

   PCA was used for data exploration ([273]Fig. 1C). OCRs were mapped to
   genomic locations (promoter, intronic, distal intergenic, etc.) using
   the ChIPseeker r-package (version 1.18.0) ([274]87) under the following
   arguments: region range of transcription start site (TSS) as +/− 1000
   bp, gene level, “TxDb.Hsapiens.UCSC.hg38.knownGene” TxDb object
   (version 3.4.0) and “org.Hs.eg.db” annotation package (version 3.7.0)
   (fig. S2A). The Gini index was computed for each OCR to measure
   “chromatin inequality” (fig. S10B). A Gini index of 0 represents
   perfect equality and similar chromatin accessibility in all cell
   subpopulations, while a Gini index of 1 represents perfect inequality,
   all the accessibility OCRs come from a specific cell subpopulation, and
   the others show no accessibility. To compute the Gini index, the
   maximum chromatin accessibility observed from the normalized count
   matrix was assigned for each cell subpopulation. Then, the gini
   function from the REAT r-package (version 3.0.2) was applied ([275]88).
   Differential accessibility analysis was performed to identify
   differential accessibility regions between transition states (HSC
   versus CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.)
   using a quasilikelihood negative binomial generalized log-linear model
   (glmQLFit) from edgeR (version 3.24.3) ([276]76). OCRs with an adjusted
   P value ≤0.01 (Benjamini-Hochberg) and absolute fold change ≥2 were
   defined as significant. A clustering analysis was performed to identify
   the different accessibility patterns within the differentially
   accessible regions (DARs) under a model-base clustering based on
   parameterized finite Gaussian mixture models [mclust r-package (version
   5.4.5) under the following function and parameter:
   Mclust(modelName = “EII”)] (fig. S6B) ([277]77). To explore the
   accessibility of TFs, we computed the aggregated motif score (TFBS
   accessibility score) for each TF using the chromVAR r-package (version
   1.5.0) ([278]89) with the default parameters. For plotting, the TFBS
   score was used ([279]Fig. 1E and fig. S3).

Gene set variation analysis (GSVA)

   We performed the GSVA to compute the enrichment of specific gene sets
   within each individual sample. We applied GSVA to explore the
   transcriptional proximity of different B cell states between humans and
   mice ([280]GSE109125) ([281]Fig. 1H) ([282]19). To this end, we used
   the defined biomarker gene sets for each B cell subpopulation in humans
   (fig. S4) and computed the GSVA enrichment scores using the algorithm
   of Hänzelmann et al. ([283]90) by calling the gsva() function from the
   GSVA R package (version 1.50.5). We also used GSVA to assess the gene
   set enrichment of TF regulons in each B-ALL sample. We computed the
   GSVA of the TF regulons within the datasets of Li, Rainer, Iacobucci,
   Singh, and Zinngrebe ([284]Fig. 5, B and F to I, and fig. S12). To
   explore significant differences at the TF regulon level regarding
   treatment response, we used the GSVA results from the Singh and
   Zinngrebe datasets as input and performed differential analysis using
   limma (version 3.58.1) and sva (version 3.50.0) to adjust for sample
   heterogeneity using surrogate variable analysis ([285]91). TF regulons
   with an adjusted P value ≤0.05 (Benjamini-Hochberg) were defined as
   significant (fig. S12).

OCR variance component analysis

   We used a linear mixed model approach for assessing the proportion of
   variance in gene expression explained by promoter and distal enhancer
   OCRs’ activity (fig. S10C). For each gene, we fit the following linear
   mixed model regression
   [MATH: <mrow><mi mathvariant="normal">Y</mi><mo>=</mo><mi
   mathvariant="normal">U</mi><mo>_</mo><mtext>promoter</mtext><mo>+</mo><
   mi
   mathvariant="normal">U</mi><mo>_</mo><mtext>enhancers</mtext><mo>+</mo>
   <mtext>Epsilon</mtext></mrow> :MATH]

   where Y is a vector of zero-centered gene expression values, U_promoter
   and U_enhancer are random effects, and Epsilon explains noise
   unaccounted by the model. The limix python library was used for fitting
   the regression models ([286]https://github.com/limix/limix) ([287]92).

OCRs association with target genes (CREs)

   To determine the CREs of a gene (OCR-gene pairs), we follow two
   approaches: (i) For promoter OCRs, we directly annotate OCRs to genes
   based on TSS distance (<1 kb), and (ii) for intronic or distal
   intergenic OCRs, we explored the association between chromatin
   accessibility and gene expression. First, we identified all intronic or
   distal intergenic OCRs for each expressed gene within a window of ±1 Mb
   from the gene’s TSS. Second, the Spearman correlation was computed
   across all samples (59 samples; eight cell subpopulations) for each
   OCR-gene pair (fig. S2B). Bonferroni correction was used to adjust for
   multiple testing. OCR-gene pairs with an adjusted P value <0.05 were
   considered significant. A complete list of pairs (62,559) is available
   in OSF ([288]https://osf.io/gswpy/).

OCRs association with TFs (trans-regulatory elements)

   To annotate OCRs to putative TF binding motifs (TF-OCR pairs), we first
   identify the footprints within our OCRs. For each sample, we recognize
   the footprints using two algorithms, HINT ([289]93) and Wellington
   ([290]94), using as input the corresponding sample BAM file and BED
   file from the corresponding cell subpopulation. For the footprints
   identified by HINT, we filtered out those footprints with less than 20
   counts. Then, we get the union of the footprints identified for each
   sample through both strategies. Next, we compute the consensus
   footprint for each cell subpopulation following a similar strategy as
   for the OCR consensus. To get the consensus footprint for each cell
   subpopulation, we found the genome regions covered by at least n − 1 of
   the sets of regions (n being the total number of samples for a specific
   cell subpopulation). Last, to get the consensus footprint in the whole
   population, we get the regions in the genome where the coverage is at
   least in 1 of the consensus footprint from the different cell
   subpopulations, and then those with a width less than 6 bp were
   removed, resulting in a total of 380,922 ATAC-seq footprints. These
   processes were done using the rtracklayer (version 1.58.0) and
   GenomicRanges (version 1.50.2) r-packages ([291]83, [292]84). The
   identified footprints were located in 72,771 OCRs (69%).

   After identifying the footprints, we want to annotate the putative TF
   binding motif to each footprint. To do that, we used the Vierstra
   database ([293]95). First, we generate a reduced version of the
   Vierstra database, getting those annotations falling within our OCR
   regions using TABIX ([294]96). Then, we annotated at the archetype
   level (version 1.0) to reduce motif redundancy and overlap, resulting
   in 30,258,213 region-motif annotations and 286 motif archetypes. The
   overlap between footprints and motif archetypes was explored, defining
   a minimum overlap of 90% of archetype width. Last, footprints were
   linked to the original OCR to get the OCR-archetype matrix, resulting
   in a matrix 105,530 by 282.

   To translate archetypes to TFs, we define the following strategy.
   First, we compute the aggregated motif score (TFBS accessibility score)
   for each archetype using the chromVAR r-package (version 1.5.0)
   ([295]89) with the default parameters. Then, to identify the TFs most
   likely associated with an archetype, we compute, for each archetype,
   the Spearman correlation between the TFBS score of the archetype and
   the RNA expression of each TF clustered in that archetype. Correction
   for multiple comparisons was done using the Benjamini-Hochberg method.
   Moreover, the TFBS score variability and the coefficient of variation
   of TF gene expression were computed. We determine the TF-archetype
   association when (i) significant variation for both the TFBS score and
   TF gene expression and significant and positive correlation is observed
   (adjusted P value <0.05 and ρ > 0) and (ii) no considerable variation
   is observed for the TFBS score (<1Q TFBS variability) or TF gene
   expression (<1Q RNA coefficient of variation), independently of the
   correlation result, assuming that correlation will not be meaningful
   because of the low variability of the variables assessed. Last, we
   obtained a list of archetype-TF associations that allowed us to
   translate archetypes to TFs and get the OCR-TF matrix (105,530 by 275)
   with 60,881 OCRs linked at least to one TF motif.

Generating the GRNs and regulons of B cell differentiation

   On the basis of the different associations of OCR-gene pairs
   (cis-regulation; 62,559 links) and TF-OCR pairs (trans-regulation;
   2,299,328 links), we defined the GRN for early B cell differentiation
   (fig. S2B). To do that, from the defined OCR-gene and TF-OCR pairs, we
   first keep those interactions derived from the DEGs (differential
   expression at least in one transition), their promoter regions, and the
   associated enhancer elements with differential accessibility
   (differential chromatin accessibility at least in one transition) under
   the assumption that those genes and OCRs changing across cell
   subpopulations are going to be the ones governing the B cell
   differentiation process. Then, having the putative TF-OCR-gene
   interactions (n = 1,151,024), the correlation between TF-OCR and
   TF-gene was computed using Spearman correlation. To reduce spurious and
   meaningless TF-OCR-gene interactions, we excluded from the GRN those
   where (i) the TF showed no differential expression (n = 469,073); (ii)
   TF-gene association was not significantly correlated (Spearman P value
   >0.05; n = 186,837); (and iii) meaningless interactions including (a)
   positive correlation for TF-gene and TF-OCR but negative correlation
   for TF and gene, (b) negative correlation for TF-gene and TF-OCR but
   positive correlation for TF and gene, (c) positive correlation for
   TF-gene and TF-OCR but negative correlation for OCR-gene, (d) negative
   correlation for TF-gene, TF-OCR, and OCR-gene, and (e) positive
   correlation for TF-gene and OCR-gene but negative for TF-OCR
   (n = 117,846). Last, 377,268 TF-OCR-gene interactions were retained to
   reconstruct the GRN of the B cell differentiation process, involving
   169 TFs, 7310 genes, and 16,074 OCRs. By exploring the TF-gene
   associations, we defined specific gene sets for each TF, referred to as
   regulons (TF regulons).

   Specific GRNs for cell subpopulations (csGRN) were computed on the
   basis of the chromatin accessibility and gene expression for each cell
   subpopulation. For example, for HSCs, the GRN captures those
   TF-OCR-gene interactions where genes were highly expressed in that cell
   subpopulation according to gene expression patterns (fig. S6A; genes
   included in brown clusters), and a median of normalized chromatin
   accessibility ≥1 (defining accessible OCR) was observed.

Regulon cell type specificity and functionality analysis

   To evaluate the cell type specificity of each predicted regulon, we
   calculated the RSSs per cell type and regulon. The RSS algorithm is
   based on the JSD, a measure of the similarity between two probability
   distributions ([296]25, [297]97). Specifically, we used the binary
   enrichment of regulon activity (based on target regions or target
   genes) together with the cell assignments to see how specific each
   predicted regulon is for each cell type. To compute the binarized
   regulon activity, we used the GSVA score instead of the AUCell used in
   single-cell data. Then, a threshold of 0.2 was defined for binarizing
   the data, resulting in a matrix where, for each sample and regulon, we
   know if it is enriched (1) or not (0). On the basis of these binarized
   data and the cell type assignment for each sample, we calculated the
   RSS using the calculate_rss function from the scFunctions r-package
   (version 0.0.0.9000) ([298]https://github.com/FloWuenne/scFunctions/).
   The output was a data frame with an RSS score for each regulon-cell
   type combination. The RSS was computed twice for each regulon according
   to TF-gene behavior, a positive correlation between TF and gene pointed
   to an activator role and a negative correlation to a repressor role. A
   heatmap/dot plot shows the TF expression of the regulon on a color
   scale and cell type specificity (RSS) of the regulon on a size scale
   ([299]Fig. 2, A and B, and fig. S7A).

   To explore the biological functions of the defined regulons, we
   performed an ORA of the target genes of each regulon using the enricher
   function of the clusterProfiler r-package (version 3.18.1) ([300]78).
   The gene ratio and −log[10] from the adjusted P value under the
   Benjamini-Hochberg method were computed using GO or Kyoto Encyclopedia
   of Genes and Genomes databases (table S4). Top significant and highly
   relevant biological pathways were selected for the summary circular
   plot ([301]Fig. 2C) generated using the chordDiagram function from the
   circlize r-package (version 0.4.15) ([302]98). For the comparison of
   the pathway enrichment analysis for a list of gene sets, such as the
   target genes of a regulon in each cell state ([303]Fig. 3, E and F, and
   fig. S7, C and D), we used the compareCluster function from the
   clusterProfiler r-package (version 4.10.0) ([304]78).

Topological property and network distance estimation and visualization

   According to the structure of our GRNs, different network properties
   were computed using the igraph r-package (version 1.3.5) ([305]99).
   Node betweenness (the amount of influence a node has over the flow of
   information in a graph through the shortest path calculation between
   all pairs of nodes) and degree (the total number of edges adjacent to a
   node) were evaluated for the complete set of GRNs with weighted edges
   according to Spearman’s coefficient of correlation between TF and gene
   ([306]Fig. 2E). Moreover, Pearson correlation was computed to compare
   the networks between them, and multidimensional scaling was used for
   visualization ([307]Fig. 2D).

Single-cell samples (in-house dataset)

   A total of five human bone marrow samples was collected in our
   institution for the single-cell analysis after informed consent was
   obtained. Four samples were CD34+ human bone marrow cells, processed by
   a single-cell multiome, and one human bone marrow sample sorted for
   hematopoietic lineage (sorting strategy: CD34+ CD38− CD3− CD64− CD56-
   CD90+; fig. S1B) was divided to perform both single-cell RNA-seq
   (scRNA-seq) and single-cell ATAC-seq (scATAC-seq) analyses.

Single-cell multiome data generation (in-house dataset)

   The epigenomic landscape and gene expression of 10,000 individual
   nuclei from the bone marrow were examined using Single Cell Multiome
   ATAC + Gene Expression (10x Genomics) according to the manufacturer’s
   instructions ([308]CG000365 RevC and [309]CG000338 RevF). Briefly,
   100,000 CD34+ cells were lysed for 2 min, and then 16,100 nuclei were
   transposed in bulk. Next, nuclei were partitioned in gel
   bead-in-emulsions (GEMs) in the Chromium X instrument (10x Genomics),
   where each nucleus was captured separately and its transposed DNA and
   cDNA were uniquely indexed. Preamplified transposed DNA and cDNA were
   then used to generate separate ATAC-seq and RNA-seq dual-indexed
   libraries. Sequencing was performed in a NextSeq2000 (Illumina) at a
   depth of at least 20,000 reads per nucleus for gene expression (Read1:
   28; Read2: 90; i7 index: 10; i5 index: 10) and 25,000 reads per nucleus
   for chromatin profiling (Read1: 50; Read2: 49; i7 index: 8; i5 index:
   24).

scRNA-seq data generation (in-house dataset)

   The transcriptome of the hematopoietic cells from bone marrow was
   examined using NEXTGEM Single Cell 3’ Reagent Kits v2 (10x Genomics)
   according to the manufacturer’s instructions. Between 17,600 and 20,000
   cells were loaded at a concentration of 700 to 1000 cells/μl on a
   Chromium Controller instrument (10x Genomics) to generate single-cell
   GEMs. In this step, each cell was encapsulated with primers containing
   a fixed Illumina Read 1 sequence, a cell-identifying 16-nt 10x barcode,
   a 10-nt unique molecular identifier (UMI), and a poly-dT sequence. Upon
   cell lysis, reverse transcription yielded full-length, barcoded cDNA.
   This cDNA was then released from the GEMs, PCR amplified, and purified
   with magnetic beads (SPRIselect, Beckman Coulter). Enzymatic
   fragmentation and size selection were performed to optimize cDNA size
   before library construction. Fragmented cDNA was then end repaired,
   A-tailed, and ligated to Illumina adaptors. Final PCR amplification
   with barcoded primers allowed sample indexing. Library quality control
   and quantification were performed using a Qubit 3.0 Fluorometer (Life
   Technologies) and Agilent’s 4200 TapeStation System (Agilent),
   respectively. Sequencing was performed in a NextSeq500 (Illumina)
   (Read1: 26 cycles; Read2:55 cycles; i7 index: 8 cycles) at an average
   depth of 60,226 reads per cell.

scRNA-seq data analysis (in-house dataset)

   The bcl (binary base call) sequence file format obtained on the 10x
   Genomics Chromium sequencing platform was transformed into fastq format
   using mkfastq from Cell Ranger 7.1.0 (for scRNA-seq) ([310]71) or
   Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count
   matrix used for downstream analysis was generated using the count
   pipeline from Cell Ranger 7.1.0 (for scRNA-seq) or Cellranger-arc 2.0.2
   (for multiome single-cell sequencing) with the GRCh38 reference
   transcriptome provided by 10x Genomics. Following Seurat guidelines
   (version 4.3.0.1) ([311]100), several cell and feature quality
   filtering steps were applied to each sample. First, doublets were
   removed using scDblFinder (version 1.12.0) ([312]101), and then cells
   with >30% of mitochondrial percentage in single-cell multiome or >10%
   in scRNA-seq were filtered out. Also, cells with <5% of ribosomal
   percentage were filtered out in the scRNA-seq sample. Last, we filtered
   out extreme outlier cells or features on the basis of lower (0.05) to
   upper (0.95) bound quantiles for “number of features” and “counts per
   cell.”

   Normalization of the gene expression measurements for each cell was
   done by multiplying the total expression by a scale factor (10,000) and
   log-transforming it. Feature selection criteria have been used for
   downstream analysis based on most variable genes using 2000 top genes.
   Feature subspaces were obtained from most variable genes using PCA
   using 20 components. Clustering was computed using the Louvain
   algorithm over the principal components subspace with a resolution
   value of 0.5. For data visualization, uniform manifold approximation
   and projection (UMAP) was used, computing it over all PCA components
   obtained. Then, a first annotation based on several hematopoietic
   biomarkers for stem cells, lymphoid cells (T and B cells), and myeloid
   cells (erythroid, monocytic, and granulocytic cells) was done. Those
   cell clusters representing undifferentiated cells and lymphoid lineage
   within the analyzed CD34+ cells were kept for forward analysis. Next,
   to integrate the five single-cell transcriptional profiles, we merged
   the samples using Seurat. Merged samples were normalized on the basis
   of the gene expression measurements for each cell by multiplying the
   total expression by a scale factor (10,000) and log-transforming it.
   Regress-out of cell cycle was applied. Feature selection criteria were
   used for downstream analysis based on most variable genes using 2000
   top genes. Feature subspaces were obtained from most variable genes
   using PCA using 20 components. To remove the influence of the dataset
   of origin, we applied Harmony ([313]102). Clustering was computed using
   the Louvain algorithm over the principal components subspace with a
   resolution value of 0.8 based on the clustree approach (version 0.5.0)
   ([314]103). To annotate cell identity, we checked for B cell
   biomarkers, cell cycle phase, and AUCell score (AUCell version 1.20.2)
   ([315]18) derived from our bulk signatures (fig. S8). We also computed
   the AUCell score to assess the gene set activity of TF regulons at a
   single-cell resolution and visualized the overall profile with a
   heatmap representation scaled by rows ([316]Fig. 4D).

scATAC-seq data generation (in-house dataset)

   The scATAC experiments were performed with the Chromium Next GEM Single
   Cell ATAC kit version 1.1 from 10x Genomics following the
   manufacturer’s instructions. For samples with 100,000 cells or less,
   the nucleus isolation is done following the 10x Genomics–demonstrated
   protocol for low cell input. Briefly, cells are collected in PBS and
   0.04% BSA at 4°C; after that, cells are counted in a cell counter; and
   centrifugation is done to concentrate them in 50 μl of PBS and 0.04%
   BSA. After that, 45 μl of supernatant is removed, and 45 μl of chilled
   and freshly prepared lysis buffer is added. After 5 min of incubation
   on ice, 50 μl of wash buffer is added. After that, centrifugation is
   done to remove 95 μl of the supernatant without disrupting the pellet.
   Additional centrifugation is done with 45 μl of diluted nucleus buffer
   to remove the supernatant. Isolated nuclei are resuspended in 7 μl of
   chilled diluted nucleus buffer. Two of these microliters are used to
   determine the cell concentration using a cell counter. The other 5 μl
   is used immediately to generate a bulk transposition reaction following
   the manufacturer’s instructions and conditions. In the transposase
   reaction, the transposase enters the nuclei and preferentially
   fragments the DNA in open regions of the chromatin. In this step,
   adapter sequences are added to the ends of the DNA fragments. After
   transposition, GEMs are generated in the Chromium controller (10x
   Genomics). Upon GEM generation, the gel bead is dissolved.
   Oligonucleotides containing an Illumina P5 sequence, a 16-nt 10x
   barcode, and a Read1 sequence are released and mixed with DNA fragments
   and master mix. After incubation, GEMs are broken, and pooled fractions
   are recovered. P7, a sample index, and the Read2 sequence are added
   during library construction via PCR. Library quality control and
   quantification were performed using a Qubit 3.0 Fluorometer (Life
   Technologies), Agilent’s 4200 TapeStation System, and Bioanalyzer 2100
   (Agilent). Sequencing was performed using a NextSeq500 (Illumina)
   (Read1: 50 cycles; i7 index: 8 cycles; i5 index: 16 cycles; Read2: 50
   cycles) to an average depth of 14,900 fragments per cell.

scATAC-seq data analysis (in-house dataset)

   The bcl file format obtained on the 10x Genomics Chromium sequencing
   platform was, in this case, transformed into fastq format using mkfastq
   from Cell Ranger ATAC 1.1.0 (for scRNA-seq) ([317]104) or
   Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count
   matrix used for downstream analysis was generated similarly using the
   count pipeline from Cell Ranger ATAC 1.1.0 (for scATAC-seq) or
   Cellranger-arc 2.0.2 (for multiome single-cell sequencing) with the
   GRCh38 reference transcriptome provided by 10X Genomics.

   To make bulk ATAC-seq and scATAC-seq comparable, a requantified version
   of the scATAC-seq original count matrix was conducted to generate a new
   feature × cell raw count matrix with the same OCRs from bulk as the
   feature space using the FeatureMatrix function from the Signac package
   ([318]105). The new scATAC-seq matrix contains recalculated count
   values based on scATAC-seq genomic fragments according to the bulk
   ATAC-seq feature space. Then, we implemented quality control criteria,
   retaining cells with a FRIP (fraction of reads in peaks) value of at
   least 50% and a TSS enrichment score above 5. Cells with a
   log-transformed number of unique fragments per barcode below 3.8 were
   filtered out. Furthermore, doublet cells were identified and excluded
   using the Scrublet algorithm with default parameters ([319]106).
   Dimensionality reduction was conducted using latent dirichlet
   allocation by performing Gibbs sampling for various topic counts (K).
   The optimal K value was selected on the basis of the highest consensus
   scores from four pycisTopic proposed metrics ([320]107).

   Accessibility imputation was carried out using the impute_accessibility
   function from the pycisTopic package ([321]17), with a scaling factor
   of
   [MATH: <mrow><msup><mn>10</mn><mn>6</mn></msup></mrow> :MATH]
   to normalize accessibility scores across selected cells and regions.
   Subsequently, gene activity scores were computed using the
   get_gene_activity function from the pycisTopic package, incorporating
   region-cell probabilities, distance weights, and upstream/downstream
   search spaces ranging from 1 to 100 kb to refine gene activity
   assignments. Default values of distance weights and other parameters,
   such as gene boundaries and TSS extensions, were used to optimize the
   accuracy of the activity predictions.

   Label transfer was executed in two phases. First, annotations derived
   from scRNA-seq were transferred to the multiome scATAC-seq (activity
   scores) samples using annotation transfer methods including Harmony,
   BBKNN, Scanorama, and canonical correlation analysis. The accuracy of
   these transfers was evaluated against the annotations of the
   corresponding cell labels in the scRNA-seq, with Harmony yielding the
   highest accuracy across four multiome samples. In the second phase,
   this transfer method was used to annotate the multiome samples and the
   unpaired scATAC-seq sample. Last, samples were integrated using the
   Harmony method ([322]102), facilitating the generation of a common
   subspace for visualization.

scATAC-seq regulon analysis (in-house dataset)

   To explore enhancer GRNs (eGRNs) from single-cell multiomic data, we
   integrated the previously identified scATAC-seq data with guidelines
   suggested by the SCENIC+ workflow ([323]17). In line with SCENIC+
   recommendations, we identified candidate enhancer regions by binarizing
   region-topic probabilities and analyzing DARs. The binarization process
   used Otsu’s method, followed by dropout imputation by multiplying
   cell-topic and topic-region distributions, as previously established by
   cisTopic ([324]107). Normalization of the data was conducted using a
   scale factor of 10^4, after which highly variable peaks were
   identified. DARs were computed using the Wilcoxon rank-sum test,
   maintaining the same ordered labeling of cell types that was
   established in the prior scRNA-seq analysis.

   To generate eGRNs, we merged scRNA-seq and scATAC-seq data to create a
   pseudomultiome dataset. For each cell type annotation, we randomly
   selected cells from both datasets. Meta-cells were constructed by
   averaging scRNA-seq and scATAC-seq counts for each annotation with
   K = 5 cells. Motif enrichment analysis was then performed using the
   SCENIC+ pipeline, integrating binarized topics and DARs from
   pycisTopic. We applied the cisTarget and DEM (differential enrichment
   of motifs) algorithms for motif discovery.

   cisTarget identified genomic regions from the SCREEN database (HG38)
   using the Cluster-Buster tool ([325]108) and ranked motifs by motif
   scores. A minimum overlap of 40% between input regions and reference
   motifs was required. Recovery curve-based scoring and normalized
   enrichment scores (NESs) were computed for each motif, retaining only
   motifs with an NES >3. DEM performed a Wilcoxon test to assess motif
   enrichment between foreground and background regions. Motifs with
   adjusted P values <0.05 (Bonferroni corrected) and a log[2] fold change
   >0.5 were selected.

   Cistromes were generated by merging motif-based regions associated with
   TFs, providing a comprehensive map of TF-target regulatory
   interactions. Last, region-to-gene distances were computed, and
   TF-target importances were evaluated using the gradient boosting
   machine method, adhering to SCENIC+ guidelines.

scRNA-seq data analysis (public dataset, phs002371.v7.2)

   We downloaded the FASTQ files for the phs002371.v7.p2 study from Human
   Tumor Atlas Network (HTAN; phs002371.v7.p2) ([326]39). To generate the
   count matrices needed in the downstream analysis, we used the count
   pipeline from Cell Ranger 7.1. with the GRCh38 reference transcriptome
   provided by 10x Genomics. Following Seurat guidelines (version 5.0.1)
   ([327]109), several cell and feature quality filtering steps were
   applied to each sample. First, doublets were removed using scDblFinder
   (version 1.12.0) ([328]101), and then following the authors’
   preprocess, cells with >15% of mitochondrial percentage, cells with
   fewer than 1500 UMIs or greater than 50,000 UMIs, and cells with
   hemoglobin detection (>10 counts of hemoglobin-related genes) were
   filtered out. Healthy samples (n = 15 samples from five donors) were
   merged, and normalization of the gene expression measurements for each
   cell was done by multiplying the total expression by a scale factor
   (10,000) and log-transforming it. Feature selection criteria have been
   used for downstream analysis based on most variable genes using 1500
   top genes. Regress-out of counts and mitochondrial percentage was
   applied. Feature subspaces were obtained from most variable genes using
   PCA using 20 components. Clustering was computed using the Louvain
   algorithm over the principal components subspace with a resolution
   value of 0.5. For data visualization, UMAP was used, computing it over
   all PCA components obtained. Then, a first annotation based on several
   hematopoietic biomarkers for stem cells, lymphoid cells (T and B
   cells), and myeloid cells (erythroid, monocytic, and granulocytic
   cells) was done. Those cell clusters representing undifferentiated
   cells and lymphoid lineage within the analyzed cells were kept for
   forward analysis. Next, to integrate the 15 single-cell transcriptional
   profiles of the subset of cells of interest, the merged samples were
   normalized on the basis of the gene expression measurements for each
   cell by multiplying the total expression by a scale factor (10,000) and
   log-transforming it. Regress-out of cell cycle was applied. Feature
   selection criteria were used for downstream analysis based on most
   variable genes using 2000 top genes. Feature subspaces were obtained
   from most variable genes using PCA using 20 components. To remove the
   influence of the dataset of origin, we applied Harmony ([329]102).
   Clustering was computed using the Louvain algorithm over the principal
   components subspace with a resolution value of 0.8 based on the
   clustree approach (version 0.5.0) ([330]103). To annotate cell
   identity, we checked for cell biomarkers, cell cycle, and AUCell score
   (AUCell version 1.20.2) ([331]18) derived from our bulk signatures
   (fig. S8).

Public RNA-seq data from patients with BCP ALL (Li dataset)

   Transcriptomic data from patients with BCP ALL (leukemic cells from
   bone marrow aspirates) were obtained from the European Genome-phenome
   Archive. The RNA-seq data (BAM files) from EGAD00001002151,
   EGAD00001002692, and EGAD00001002704 datasets were downloaded after the
   corresponding access request. Data from adult and pediatric individuals
   belonging to a well-known molecular subgroup ([332]40), including MEF2D
   fusions, TCF3-PBX1, ETV6-RUNX1/like, DUX4 fusions, ZNF384 fusions,
   BCR-ABL1/Ph–like, hyperdiploidy, and KMT2A fusions were included in the
   analysis, resulting in 469 patients ([333]Fig. 5A).

   As BAM files were mapped to GRCh37, we first obtained the Fastq files
   using the bamtofastq function from bedtools (version 2.25.0) ([334]86).
   Next, the read quality was assessed with FastQC software (version
   0.11.7) (Babraham Institute). Low-quality, short-read, and TruSeq3-SE
   adapters were trimmed using Trimmomatic (version 2.6.0a) ([335]73)
   under the following parameters: ILLUMINACLIP:TruSeq3-SE:2:30:10
   LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75. Reads were then
   mapped to the December 2013 H. sapiens high-coverage assembly GRCh38
   from the Genome Reference Consortium GRCh38 (hg38) using STAR (version
   2.6.0a) ([336]74) and counted for each gene using htseq-count ([337]75)
   with the –m intersection-nonempty option and a GTF file from GRCh38
   Ensembl annotation release 96, getting the information for a total of
   58,884 genes. Genes with a minimum read count of 5 in all samples were
   retained (32,416 genes). Next, batch correction was done using the
   ComBat_seq function from the sva r-package (version 3.38.0) ([338]91,
   [339]110). Counts were normalized under TMM normalization and voom
   transformation from the edgeR r-package (version 4.0.2) ([340]76). PCA
   was used for data exploration (fig. S10, A to C). Differential
   expression analysis of leukemic subtypes was performed using limma.
   Genes with an adjusted P value ≤0.01 (Benjamini-Hochberg) and fold
   change ≥1.5 in at least six contrasts were defined as significant to
   get the B-ALL subtypes biomarkers. The B-ALL subtype biomarkers list
   can be found in table S12.

ORA in public BCP ALL data (Li dataset)

   The ORA of the gene sets from B-ALL subtype signatures (fig. S9, D and
   E, and table S12) was performed in the signatures of the different B
   cell populations (fig. S4) using the enricher function of the
   clusterProfiler r-package (version 3.18.1) ([341]Fig. 5C) ([342]78).
   The gene ratio and −log[10] from the adjusted P value under the
   Benjamini-Hochberg method were computed. The ORA was also applied to
   explore the enrichment of 711 binding sites defined as the
   highest-affinity binding sites for the ETV6-RUNX1 fusion under ChIP-seq
   analysis ([343]41) in the consensus OCRs for each cell subpopulation
   ([344]Fig. 5D). Last, we applied the ORA to explore whether there was a
   significant enrichment of the regulons identified along B cell
   differentiation in the different B-ALL subtype signatures derived from
   the Li dataset ([345]Fig. 5E).

Public transcriptomic data from patients with BCP ALL

   Transcriptomic data from patients with BCP ALL (leukemic cells from
   bone marrow aspirates) were obtained from different public datasets to
   validate the results found under the Li dataset approach. Four
   different datasets were obtained from Gene Expression Omnibus (GEO):
   Rainer dataset ([346]GSE73578) ([347]42), microarray-based whole-genome
   expression profiling of lymphoblasts from 46 patients; Iacobucci
   dataset ([348]GSE13204) ([349]43), microarray gene expression profiling
   of 144 adult patients with B-ALL and who are B-ALL negative for known
   molecular rearrangements; Singh dataset ([350]GSE153670) ([351]68),
   RNA-seq data from bone marrow samples taken from patients enrolled on
   ENSIGN or ELIANA clinical trials before infusion of CART19 (n = 27);
   Zinngrebe dataset ([352]GSE140556) ([353]69), microarray data from
   eight SMAC mimetic monotherapy (SM)–sensitive and six SM–insensitive
   BCP ALL samples. Microarray processed data downloaded from GEO were
   used for the computation of GSVA score for each dataset. In the case of
   the Singh dataset, the downloaded raw counts from RNA-seq were
   filtered, defining a minimum read count of 5 in all samples per gene
   and normalized under TMM normalization and voom transformation from the
   edgeR r-package (version 4.0.2) ([354]76). PCA was used for data
   exploration (fig. S10, F to I).

Public ChIP-seq histone marks

   ChIP-seq data from the ENCODE project ([355]www.encodeproject.org/) for
   H3K4me1, H3K4me3, and H3K27ac on B cells, naïve B cells, and common
   CD34+ myeloid progenitors were downloaded (table S1). To get the
   consensus regions for each cell subpopulation and histone mark, we
   found the genome regions that are covered by at least n − 1 of the sets
   of peaks (n being the total number of samples for a specific cell
   subpopulation), and then we merged the nearby peaks with a minimum gap
   width of 31. To get the consensus regions for a histone mark in the
   whole population, we get the regions in the genome where the coverage
   is at least in 1 of the consensus histone marks from the different cell
   subpopulations and then merged the nearby peaks with a minimum gap
   width of 31, resulting in 166,295 H3K4me1 regions, 37,952 H3K4me3
   regions, and 159,079 H3K27ac regions. These processes were done using
   the rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2)
   r-packages ([356]83, [357]84). Then, the overlap between our OCRs and
   histone marks was assessed using the findOverlaps function
   (minoverlap = 6) of GenomicRanges (fig. S13A).

Public Hi-C data

   Hi-C public data of naïve B cells were obtained from EGAD00001006486
   ([358]51). The valid reads of Hi-C experiments were processed with
   TADbit version 1.1 ([359]111). Hi-C data were normalized using the oned
   function of the dryhic r-package (version 0.0.0.9100) ([360]112) at a
   100-kb resolution to remove known experimental biases, a method that
   controls for the presence of abnormal karyotypes in cancer samples.
   Topologically associating domains were detected at a 40-kbp resolution
   using the TopDom algorithm ([361]113). The significant Hi-C
   interactions were called with the Fit-Hi-C tool (version 1.1.3)
   ([362]114), binned at a 10-kb resolution and with the default
   thresholds: P value <0.001 and q-value <0.001.

   To study the compartmentalization of the genome in naïve B cells, we
   used merged biological replicates, resulting in interaction Hi-C maps
   with around 300 million valid reads each. Briefly, normalized
   chromosome-wide interaction matrices at a 100-kb resolution were
   transformed into Pearson correlation matrices. These correlation
   matrices were then used for PCA, for which the first eigenvector (EV)
   normally delineates genome segregation. The multimodal distribution of
   the EV coefficients from the B cell dataset was modeled as a Gaussian
   mixture with three components (k = 3). To estimate the mixture
   distribution parameters, an expectation-maximization algorithm using
   the normalmixEM function from the mixtools r-package (version 1.2.0)
   was applied ([363]115). A Bayesian information criterion was computed
   for the specified mixture models of clusters (from 1 to 10) using the
   mclustBIC function from the mclust r-package (version 5.4.9) ([364]77).
   Three underlying structures were defined: alternative
   compartmentalization into A-type (with the most positive EV values),
   B-type (with the most negative EV values), and I-type (an
   intermediate-valued region with a different distribution) compartments.
   Two intersection values were defined at the intersection points between
   two compartments and used as standard thresholds to categorize the data
   into the three different compartments; that is, the A-type compartment
   was defined for EV values >=+0.1388, the I-type compartment was defined
   for EV values between +0.1388 and −0.01955, and the B-type compartment
   was defined for EV values between <=−0.01955.

   Then, we checked the overlap between our OCRs and the compartments
   identified using the findOverlaps function (default parameters) of the
   GenomicRanges r-package (version 1.50.2) (fig. S13B) ([365]84). When an
   OCR overlaps with two compartments, we assigned it to the compartment
   with a higher overlap.

Public promoter capture Hi-C interaction data

   Detected interactions using the promoter capture Hi-C technology from
   Javierre et al. ([366]52) were downloaded. From the promoter capture
   Hi-C peak matrix, we select all those interactions with a CHiCAGO score
   ≥5 in the naive B cell population, resulting in 197,442 interactions.
   The coordinates of the interacting regions were translated from GRCh37
   to GRCh38 using the liftOver r-package (version 1.18.0). Then, we check
   the overlap between our interactions (OCR-gene associations) and their
   interactions using the findOverlaps function (minoverlap = 6) of the
   GenomicRanges r-package (version 1.50.2) (fig. S13C showing some of the
   overlapping interactions) ([367]84).

Web application

   The Shiny r-package (version 1.8.0) ([368]116) was used to generate a
   user-friendly web interface for browsing called B-rex
   ([369]https://translationalbio.shinyapps.io/brex/). For scRNA-seq, the
   ShinyCell template (version 2.1.0) was used ([370]117). The Shiny app
   is hosted on [371]https://www.shinyapps.io/.

Acknowledgments