Abstract
Dysregulation of early B cell lymphopoiesis—the process guiding
cellular immunity development—can lead to malignancy, making it crucial
to understand its regulatory mechanisms. We generated a multiomics
resource comprising paired chromatin accessibility and gene expression
profiles across eight human B cell precursor populations, providing a
detailed characterization of early human B cell development.
Integrative analysis revealed highly cell type–specific regulatory
elements and enabled the reconstruction of the gene regulatory network
governing differentiation. We identified putative candidate regulons,
such as ELK3, enriched in pro–B cells and potentially involved in cell
cycle progression. Regulons from bulk data were projected onto
single-cell data, validating their activity and refining the regulatory
landscape. This resource enabled identification of active regulatory
programs and transformation-associated states in B cell acute
lymphoblastic leukemia. The publicly available atlas provides a
valuable resource for understanding B cell development and disease,
supporting future efforts to decode regulatory programs in immunity and
hematologic malignancies.
__________________________________________________________________
Regulatory landscape of human B cell lymphopoiesis provides insights
into the pathogenesis of B cell acute lymphoblastic leukemia.
INTRODUCTION
B cell lymphopoiesis gives rise to a diverse repertoire of peripheral B
cells, antibody-producing cells capable of mediating protection against
pathogens while remaining tolerant of self-antigens. Before migration
to the periphery, B cell development occurs in the bone marrow
following a complex process governed by the sequential rearrangement of
the immunoglobulin (Ig) heavy and light chain genes and their
expression and subsequent assembly of their encoded proteins into the B
cell receptor (BCR) ([72]1, [73]2). This highly ordered and controlled
process is divided into several cell stages or B cell subpopulations
([74]3, [75]4). The transcriptional profile of each cell subpopulation
emerges from an underlying gene regulatory network (GRN) in which a
limited number of transcription factors (TFs) and cofactors regulate
each other and their downstream target genes ([76]5). Several TFs
orchestrating the B cell development have been described ([77]2,
[78]6). Key lineage TFs, such as IKZF1 (Ikaros family zinc finger 1)
and PU.1, are implicated in the initial commitment toward the lymphoid
lineage ([79]7, [80]8). PAX5 (paired box 5), EBF1 (early B cell factor
1), and E2A [TCF3 (transcription factor 3)] are regulators of B cell
lineage commitment ([81]9–[82]11), while IRF4 (interferon regulatory
factor 4), IRF8, and FOXO1 (forkhead box O1) are generally associated
with Ig rearrangement ([83]12, [84]13). Last, OCT2 [POU2F2 (POU class 2
homeobox 2)] and OBF1 (OCT-binding factor 1) are crucial TFs that drive
the migration of immature B cells to the periphery ([85]14, [86]15).
Most of this regulatory knowledge is derived from studies in mice, and
thus, it is unclear whether the mechanisms identified follow the same
regulatory programs as those of humans ([87]16).
The integrated analysis of paired chromatin accessibility and gene
expression profiling across various cell types offers a powerful
approach to reconstructing GRNs and identifying cell type–specific
regulons (TFs and their associated set of targets) ([88]17–[89]19), a
crucial approach to deciphering the regulation programs in humans. Such
an analysis involves predicting cis-regulatory elements (CREs),
identifying potential TFs, and linking the CREs to their candidate
target genes. This strategy was followed to provide initial insights
into the cis-regulation of mouse hematopoiesis ([90]19), capturing the
key TFs committing the B cell lineage, and in the study of
mixed-phenotype acute leukemia, revealing RUNX1 (runt-related
transcription factor 1) as a putative regulator ([91]20).
More recent advances in multiomics single-cell approaches have allowed
GRN inference, applied to study human peripheral blood mononuclear
cells ([92]17) and plasma cell differentiation in the human tonsil
([93]21). Furthermore, a detailed characterization of B cell
subpopulations in mice and human lymphopoiesis has been obtained using
single-cell technology approaches ([94]3, [95]4, [96]22, [97]23).
Despite notable progress, the understanding of the regulatory
mechanisms of this process remains incomplete, with current single-cell
multiomics experiments providing insufficient depth (detailed
profiling) and breadth (enough individuals) to conduct detailed
epigenetic profiling of the early stages of human B cell
differentiation.
To address this gap and to decipher the GRN of early B cell
differentiation in humans, we generated comprehensive chromatin
accessibility and transcriptional profiles corresponding to the
stepwise progression of B cell precursors (BCPs). Our study included
eight cell subpopulations across 13 healthy donors. A detailed joint
analysis of paired high-throughput data [bulk RNA sequencing (RNA-seq)
and assay for transposase-accessible chromatin using sequencing
(ATAC-seq)] allowed the generation of a comprehensive GRN of B cell
differentiation, capturing cell type specificity and dynamism of
regulons, also recovering the master regulators of this process. Our
resource not only recapitulated well-known regulatory elements but also
revealed candidate regulators of early B cell differentiation, such as
ELK3, which appears to be associated with pro–B cell cycle progression.
Next, we further explored the inferred regulatory programs at a
single-cell resolution, which confirmed the observed regulatory
dynamics and enabled a finer-grained definition of regulon-based early
B cell differentiation. Last, we leveraged this detailed
characterization of normal B cell development to deconvolve B cell
malignancies, revealing transformation-associated cellular states and
highlighting subtype-specific regulatory programs in certain forms of B
cell acute lymphoblastic leukemia (B-ALL). This detailed and extended
portrait of regulatory elements has been made publicly available in a
user-friendly web interface for browsing (B-rex;
[98]https://translationalbio.shinyapps.io/brex/), representing a rich
playground to further unravel the regulatory mechanisms of B cell
differentiation in health and disease.
RESULTS
Profiling of early human B cell differentiation stages by RNA-seq and
ATAC-seq
To investigate early human B cell differentiation, we harvested bone
marrow and peripheral blood from the same healthy individuals [n = 13;
mean age, 21.4 (20 to 23)]. Using fluorescence-activated cell sorting
(FACS), we isolated hematopoietic stem progenitor cells (HSPCs), common
lymphoid progenitors (CLPs), pro–B cells, pre–B cells, and immature B
cells from each bone marrow sample and transitional B cells and naïve B
cells (CD5+ and CD5−) from peripheral blood specimens ([99]Fig. 1A and
fig. S1A). All these populations were characterized using paired
transcriptomic (RNA-seq) and chromatin accessibility (ATAC-seq) assays
(fig. S2A and table S1).
Fig. 1. Characterization of early B cell differentiation through chromatin
accessibility and gene expression profiling.
[100]Fig. 1.
[101]Open in a new tab
(A) Major BCP subpopulations from the human bone marrow and peripheral
blood. The marker strategy followed to isolate eight B cell
subpopulations from 13 healthy individuals is shown. Markers strongly
expressed by the given B cell subpopulation are depicted in black,
whereas a lack of expression is shown in gray. (B) PCA of gene
expression (79 samples), (C) chromatin accessibility (78 samples), and
(D) TFBS activity score (78 samples). Each dot represents a sample
colored by cell subpopulation. (E) Heatmap representation of TFBS
activity score for key B cell regulators. Samples are shown by columns
(n = 78) and grouped by cell subpopulation, and TFs are shown by rows
(n = 12). (F) Representation of top 10 enriched GO terms derived from
the transcriptional changes over B cell transitions. The NES is shown
in a red-blue color range for each term and the −log[10] of the
adjusted P value as dot size. Biological functions are grouped into
four main terms. (G) Heatmap representation of the gene expression
profile for 58 well-known genes associated with B cell lymphopoiesis.
Samples are shown by columns (n = 79; grouped by cell subpopulation),
and genes are shown by rows (n = 58; grouped by biological processes).
Normalized gene expression levels are represented. (H) Heatmap
representation of the gene set activity score (GSVA) computed for each
human B cell subpopulation signature (n = 8; in rows) in each mouse
sample (n = 17; in columns). The GSVA score represented in the heatmap
is scaled by column to highlight the human B cell subpopulation with
higher enrichment in each mouse B cell subpopulation (in dark blue).
Principal components analysis (PCA) of both chromatin accessibility and
gene expression data revealed consistent and well-defined agreement of
cell types across individuals, effectively recapitulating the dynamic
process of B cell differentiation ([102]Fig. 1, B and C). To dissect
the TF-mediated regulatory mechanisms underlying this process, we first
inferred the TF activity by identifying TF binding at accessible or
open chromatin regions (OCRs). On the basis of these TF binding sites,
we estimated the TF activity for each sample (TFBS activity score; see
Materials and Methods), enabling the identification of highly active
regulators within distinct cell subpopulations ([103]Fig. 1, D and E).
The resulting TF activity profile showed clear cell type–specific
clustering and recapitulated known regulatory circuits associated with
B cell differentiation ([104]6). Among the transcriptional programs
identified, pro–B cells and pre–B cells displayed the most distinct
programs, underscoring key developmental transitions such as Ig
rearrangement mediated by FOXO1 and IRF4/8, with further B cell
commitment driven by EBF1 and PAX5. In HSPCs, stemness-associated
TFs—GATA2, TAL1, and MEIS1—were highly active, while CLPs showed an
increased activity of lymphoid-specifying TFs such as TCF3 and Ikaros
(IKZF1). At later stages of differentiation, TFs involved in B cell
maturation, including POU2F2 and SPIB, became prominent ([105]Fig. 1E
and fig. S3 for the TFBS profile of all TFs).
To further explore the molecular processes captured by our dataset, we
examined biological pathways enriched in transcriptional changes over B
cell transitions ([106]Fig. 1F and table S2), as well as the
transcriptional and chromatin accessibility profiles of well-known
markers of B cell differentiation ([107]Fig. 1G and
[108]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex for the chromatin
accessibility profile). We observed a significant enrichment in
proliferation and differentiation pathways for CLPs that were also
preserved in pro–B cells and pre–B cells ([109]Fig. 1F). These
pathways, in the early stages of B cell differentiation, are mainly
regulated by the interleukin-7 receptor signaling, highly expressed in
CLPs, pro–B cells, and pre–B cells, to later switch to the antagonistic
signaling by the pre–B cell receptor (pre-BCR), expressed in pre–B
cells ([110]Fig. 1G) ([111]2, [112]4). For the transition from pro–B
cells to pre–B cells, we observed an even higher enrichment of
proliferation pathways ([113]Fig. 1F), associated with pre–B cell
expansion (large pre-B or FrC′) controlled by the convergence of
interleukin-7 receptor and the pre-BCR signaling ([114]24). We also
identified enrichment for Ig heavy-chain and light-chain
rearrangements, fundamental processes that organize B lymphocyte
development. Here, we observed an enrichment of Ig-related pathways in
the transition from CLPs to pro–B cells, a state when Ig rearrangement
commences ([115]Fig. 1F), and observed increased gene expression for
RAG1 and RAG2 in pro–B cells and pre–B cells ([116]Fig. 1G). The
successful Ig rearrangements lead to BCR expression in immature B cells
and their migration to peripheral blood, which is associated with a
reduction in proliferation and an increase in migration pathways
identified during the transition from pre–B cells to immature B cells
([117]Fig. 1F).
With a wealth of knowledge on B cell differentiation derived from
studies described in mice, we next examined the conservation and
translatability of our human B cell transcriptional signatures (fig.
S4) to those equivalent in mice (fig. S5) ([118]19). To achieve this,
we performed gene set variational analysis (GSVA) to calculate gene set
activity scores for each human B cell subpopulation signature across
individual mouse samples. These unitless scores, where higher values
indicate greater gene set activity, enabled robust cross-species
transcriptional comparisons. We observed a general concordance between
species ([119]Fig. 1H), further supporting the evolutionary
conservation of transcriptional programs across human and mouse B cell
differentiation and development. In summary, this comprehensive view of
human B cell differentiation demonstrated well-preserved programs
across individuals, aligning closely with those described in mice and
highlighting the highly organized processes subjacent to functional B
cell generation.
Defining the GRN and TF regulons driving early B cell differentiation
We next combined chromatin accessibility and gene expression data to
derive the underlying GRNs driving early B cell differentiation in
healthy humans. We defined a computational workflow to decipher the GRN
from the RNA-seq and ATAC-seq coordination (fig. S2B). Briefly, we
first inferred the CREs (62,559 OCR-gene pairs) and trans-regulatory
elements (2,299,328 TF-OCR pairs derived from a footprinting analysis).
Next, we refined the resulting TF-OCR-gene links (1,823,177 links) to
focus on the regulatory programs driving B cell differentiation by (i)
retaining only those interactions involving differentially expressed
genes (DEGs), promoter OCRs, and differentially accessible enhancer
OCRs across cell state transitions—under the premise that dynamic
regulatory elements are central to lineage specification—and (ii)
excluding interactions in which the TF was not differentially expressed
(n = 469,073), correlations between TF gene expression and target genes
were not statistically significant (P > 0.05; n = 186,837), or logical
inconsistencies in correlation directionality were observed
(n = 117,846). This process resulted in a high-confidence GRN with
377,268 TF-OCR-gene interactions, involving 169 TFs, 7310 target genes,
and 16,074 OCRs (table S3 and fig. S6, A to C). Further details are
provided in Materials and Methods.
After reconstructing the global GRN architecture, we defined regulons
as sets of genes associated with a specific TF, inferred through
TF-OCR-gene interactions. Each regulon represents a functional unit
within the GRN, consisting of a TF and its co-regulated target genes
and thereby constituting a coherent regulatory program operating within
a specific cellular context. The regulon’s target genes were divided
into two groups on the basis of their correlation with TF activity:
positively correlated targets, likely activated by the TF, and
negatively correlated targets, likely repressed by the TF. To explore
the cell type specificity of these regulons, we computed the regulon
specificity score (RSS) on the basis of the Jensen-Shannon divergence
(JSD) for both predicted target regions and target genes ([120]17,
[121]25). Intuitively, the JSD measures the difference between two
probability distributions, in this context, how the accessibility or
gene expression profile of a regulon’s targets in one cell type
diverges from the average profile across all cell types. As a result, a
high RSS indicates that the regulon’s activity profile is highly
distinct and concentrated in a particular cell type, reflecting strong
regulatory specificity.
This analysis revealed the high cell type specificity of TF regulons
for both activated and repressed target genes ([122]Fig. 2, A and B,
and fig. S7A). Specifically, among the regulons with activating
functions, we identified five main clusters according to the TF gene
expression and RSS. Several regulons were associated with HSPCs,
including TAL1, MEIS1, MECOM, HLF, and GATA2 ([123]Fig. 2A, cyan
cluster; termed “HSPC regulons”). Others showed enrichment throughout
differentiation up to the pre–B cell stage, such as GFI1, RUNX1, and
MYB ([124]Fig. 2A, blue cluster; termed “early B cell regulons”). A
smaller group characterized pro–B cells, including ELK3 and BCL6B (B
cell CLL/lymphoma 6B) ([125]Fig. 2A, green cluster; termed “pro–B cell
regulons”), while well-known factors such as SPI1, IRF4, TCF3, MEF2D
(myocyte enhancer factor 2D), and EBF1 were mainly enriched in pre–B
cells and immature B cells ([126]Fig. 2A, pink cluster; termed “pre–B
cell/immature B cell regulons”). Several regulons associated with late
differentiation stages, including PAX5, CEBPB, SPIB, etc. ([127]Fig.
2A, black cluster; termed “late B cell regulons”). In addition, 51 TF
regulons exhibited repressive activity, including PAX5, EBF1, and TCF3,
according to previous knowledge ([128]Fig. 2B) ([129]26). To further
understand the biological and molecular functions of the regulons, we
performed a pathway enrichment analysis on their target genes
([130]Fig. 2C and table S4). We observed a significant enrichment of
pathways related to B lymphopoiesis and cell cycle across all regulons.
Early-stage regulons (HSPC and early B cell) showed a significant
enrichment of pathways involved in cell survival, proliferation,
differentiation, and migration, with proliferation and survival
pathways particularly prominent in HSPC regulons. Pre–B cell/immature B
cell regulons were enriched in pathways related to cellular growth,
consistent with the generation of large pre–B cells, while late B cell
regulons were enriched for bioenergetic pathways. Together, our
multiomics strategy enabled the determination of the GRN and associated
regulons governing the entire B cell differentiation process,
recovering a large set of cell type–specific regulons broadly organized
into five major clusters.
Fig. 2. Regulatory landscape of TF regulons orchestrating early B cell
differentiation.
[131]Fig. 2.
[132]Open in a new tab
(A and B) Heatmap/dot plot showing, for each TF regulon with an
activator or repressor profile, respectively, the TF gene expression on
a color scale and the RSS of target regions (OCRs) on a size scale. A
high RSS score (high dot size) and high TF gene expression (color prone
to red) indicate that the regulon is highly specific to the
corresponding cell type. TF regulons are shown by rows, and B cell
subpopulations are shown by columns. TF regulons are clustered into
five main groups colored as cyan, blue, green, pink, and black
(right-side colored bar). (C) Circus plot summarizing the main
biological pathways enriched for each regulon. On the bottom part, TF
regulons are shown. TF regulons are clustered according to groups
identified in (A) (right-side colored bar). On the top part, the
enriched biological pathways are shown, grouped into five blocks
describing bioenergetic processes (in pink); cell growth (in purple);
hematopoietic and B cell lineage and cell cycle (in yellow);
differentiation, proliferation, survival, and migration (in blue); and
proliferation and survival (in turquoise). (D) Multidimensional scaling
representation from the Pearson correlation of csGRNs. Color code by
cell subpopulation. (E) Heatmap representation of degree [left; total
number of edges (connections) that a TF has associated] and betweenness
(right; influence of a given TF over the flow of information in the
network) centrality measures computed within each csGRN. TFs with the
highest degree are shown in yellow, and those with the highest
betweenness are shown in cyan. Each column corresponds to a csGRN, and
each row corresponds to a TF regulon. TF regulons are clustered
according to patterns identified in (A) (right-side colored bar). The
top 10 TF regulons of each csGRN based on degree and betweenness are
highlighted in red.
Uncovering the master regulators of B cell differentiation through network
topology
On the basis of the defined regulons, we next performed a network
topology analysis to provide insights into the master regulators.
Intuitively, network topology analyses are based on the structural
organization of the GRN, highlighting key regulators on the basis of
their position and connectivity within the network ([133]27, [134]28).
Our topological analysis was based on two measures of network
centrality: (i) degree, which denotes the total number of edges
(connections) that a TF has associated, and (ii) betweenness, which
quantifies the influence of a given TF over the flow of information in
the network (table S5; see Materials and Methods). As expected, when we
explored the top 10 ranked TFs by degree, we obtained several TFs
involved in basic cellular survival and proliferation [several E26
transformation–specific (ETS) and Krüppel-like factor (KLF) family
([135]29), TAF1 (TATA-box binding protein–associated factor 1)
([136]30), and E2F3 ([137]31) TFs]. Meanwhile, when we explored the top
10 ranked TFs on the basis of node betweenness, we recovered critical
TFs of B cell differentiation such as EBF1, TCF3, IRF4, and BCL11A
([138]6, [139]32) and some ETS and KLF family TFs related to B cell
lymphopoiesis, including ERG (ETS-related gene), KLF13, ELK3, and ELK4
([140]33–[141]35).
Aiming to identify the specific master regulators of each cell
subpopulation, we derived a specific GRN for each cell subpopulation
[referred to as a cell type–specific GRN (csGRN); figs. S2B and S6] by
incorporating chromatin accessibility and gene expression profiles
within each cell type. For example, the HSPC-specific GRN included only
interactions where target genes were highly expressed and linked with
accessible OCRs in HSPC samples.
An initial visual inspection of the retrieved csGRNs, projected into a
two-dimensional space on the basis of their pairwise distances,
revealed that they also recapitulated the differentiation trajectories
([142]Fig. 2D). Next, we conducted the topological analysis for each
csGRN by exploring the centrality degree and betweenness measures
([143]Fig. 2E). We observed how the csGRN topology generally mirrored
the regulon specificity shown in [144]Fig. 2A, as those TFs with low
expression or chromatin accessibility in a specific cell state were
excluded from the corresponding csGRN. Among TFs with the highest
values of degree and betweenness, we recovered the primary TFs
implicated in B cell commitment, differentiation, maintenance, or Ig
recombination, which include TCF3, PAX5, IRF4, EBF1, and LEF1
([145]Fig. 2E, in red) ([146]1, [147]2). Moreover, this approach
highlighted the crucial role of IKZF1, SPB1, IRF2, and FLI1 in the
earliest stages of differentiation, reflected by their high degree and
betweenness across CLPs, pro–B cells, and pre–B cells, not previously
observed under the regulon specificity analysis ([148]Fig. 2, A and E).
Focusing on pro–B cell regulons ([149]Fig. 2, A and E; green cluster),
we identified ELK3, PPARA (peroxisome proliferator–activated receptor
α), and BCL11A as putative key regulators in the transition from pro–B
cell to pre–B cell stage. The essential role of BCL11A in lymphoid
development has already been reported ([150]32), while the implications
of PPARA and ELK3 in B cells remain unknown. To note, the ETS-factor
ELK3 has been associated with pro–B cell expansion in mice ([151]35),
defined as a regulon in large and small B cells (FrC′/D—pre–B cells) in
the bursa of Fabricius of young broilers ([152]36), and recently
described in human B cell differentiation ([153]37). Overall, our
topological analysis of csGRNs not only aligned with the functional
cell type specificity of regulons but was also essential in revealing
the master regulators of B cell differentiation, successfully
recovering well-known TFs and highlighting putative candidate
regulators, such as ELK3, that may play a key role in B cell
lymphopoiesis.
Characterization of the ELK3 regulon showed a putative key role in the
differentiation and proliferation regulation in pro–B cells
Integrating the TF regulons and comprehensive GRN analysis, we further
examined the specific regulatory activities of key human B cell TFs
globally. As an example and positive control of the GRN analysis, we
evaluated the EBF1-specific regulon (fig. S7, B to E). EBF1 is a
well-established regulator of B cell differentiation, and among its
predicted upstream regulators, we identified several known B cell TFs,
including PAX5, TCF3, FOXO1, SPIB (Spi-1/PU.1–related), RUNX1, and
CEBPA (fig. S7B), all previously reported to regulate EBF1. In
addition, within the EBF1 downstream-regulated genes, we also confirmed
previously established target TFs such as PAX5, TCF3, and FOXO1 (fig.
S7C) ([154]38). To further unveil the biological functions of the EBF1
regulon, we performed a pathway enrichment analysis on its activated
and repressed target genes. This approach identified a significant
association of the EBF1 regulon–activated genes with B cell activation,
proliferation, and differentiation pathways and the EBF1
regulon–repressed genes with metabolic and homeostatic processes (table
S6). Last, we explored the pathway analysis of the EBF1 regulon for
each csGRN, resolving pathways significantly enriched in the pro–B
cell, pre–B cell, immature B cell, transitional B cell, and naïve B
cell csGRNs (fig. S7, D and E, and table S7). Specifically, this
analysis highlighted the involvement of EBF1 regulon–activated genes in
regulating cellular division in pro–B cells and pre–B cells.
After confirming that our analysis recapitulates the known regulatory
biology of EBF1, we next assessed the role of a candidate TF, ELK3.
ELK3 showed high gene expression levels and TF activity in HSPCs, CLPs,
and pro–B cells ([155]Fig. 3, A and B) and emerged as a regulon within
pro–B cells from our GRN analysis ([156]Fig. 2). We identified 50 TFs
as potential regulators of ELK3, which included ERG, STAT5A/B (signal
transducer and activator of transcription 5A/B), E2F3, ETS2, and ETV6
(ETS variant transcription factor 6) as activators and SPIB
(Spi-1/PU.1–related) and POU2F2 as repressors ([157]Fig. 3C). The ELK3
regulon demonstrated 1339 candidate genes positively regulated and 1294
candidate genes negatively regulated by ELK3. Within the regulated
genes, 52 were TFs, including key regulators of B cell differentiation
such as PAX5, IKZF1, BCL11A, MEF2A, RUNX1, STAT5A, SPI1, and SPIB
([158]Fig. 3D). To unveil the biological functions of the ELK3 regulon,
we performed a pathway enrichment analysis of the activated and
repressed genes. This approach revealed a significant association of
the candidate ELK3 regulon–activated genes with the cell cycle
regulation, including the enrichment of several pathways related to DNA
replication, unsaturated fatty acid metabolism, and MAPK
(mitogen-activated protein kinase) and ERK (extracellular
signal–regulated kinase) cascade (table S8). On the other hand, the
candidate ELK3 regulon–repressed genes showed a significant enrichment
of several B cell activation, proliferation, and differentiation
pathways (table S9). When we assessed the pathway analysis of the ELK3
regulon for each csGRN-resolved pathway, there was a significant
enrichment in the HSPCs and pro–B cells. Specifically, it highlighted
the involvement of ELK3 regulon–activated genes in regulating metabolic
and cellular processes by the enrichment of the fatty acid metabolic
process, PI3K (phosphatidylinositol 3-kinase)/PKB (protein kinase B;
Akt) signal transduction regulation, and MAPK cascade in both HSPCs and
pro–B cells. A significant enrichment of cell cycle pathways was
observed in pro–B cells ([159]Fig. 3E and table S10). The ELK3
regulon–repressed genes showed specific enrichment in pro–B cells,
associated with several B cell activation, proliferation, and
differentiation pathways ([160]Fig. 3F and table S10). Overall, our
findings identify ELK3 as a previously underappreciated TF in B cell
lymphopoiesis, which may play a crucial role in regulating the cell
cycle and advancing B cell differentiation during the pro–B cell stage.
Fig. 3. ELK3 regulon characterization.
[161]Fig. 3.
[162]Open in a new tab
(A) Box plot showing the distribution of ELK3 gene expression and (B)
ELK3 TFBS activity score by cell subpopulation. Significant gene
expression changes between cell subpopulation transitions assessed by
the Wilcoxon test are shown. Significant changes were defined as P <
0.05 (NS., no significant; *P = 0.05, **P = 0.01, ***P = 0.001). (C and
D) Heatmap representation of gene expression for TF regulators of ELK3
and TF regulated by ELK3, respectively. Samples are shown by columns
(grouped by cell subpopulation), and TFs are shown by rows. Normalized
gene expression levels are represented. On the right side, ρ values of
Spearman correlation for each TF versus ELK3 are shown. On top, a bar
plot representation of the ELK3 gene expression is shown. (E)
Representation of the main enriched GO terms derived from the pathway
analysis of the ELK3 regulon–activated genes for each csGRN. The
adjusted P value is shown in a red-blue color range, and the number of
genes enriched within the term (counts) is shown by dot size. (F)
Representation of the main enriched GO terms derived from the pathway
analysis of the ELK3 regulon–repressed genes for each csGRN.
Significant terms only appeared in the assessment of csGRN derived from
pro–B cells. For the pro–B cell csGRN enriched terms, the adjusted P
value (red-blue color range), gene ratio, and counts (dot size) are
shown.
Single-cell translation fine-tunes and validates bulk-derived regulons
dynamics
To gain a finer-grained understanding of the transcriptional regulation
of the early B cell differentiation, we analyzed the bulk-based defined
regulons at the single-cell level. To this end, we interrogated two
single-cell datasets: a public dataset including healthy human bone
marrow cells from hematopoietic stem progenitor cells (HSPCs) to mature
B cells and plasma cells from five donors (phs002371.v7.2; [163]Fig.
4A) ([164]39) and an internally generated dataset including bone marrow
CD34+ hematopoietic progenitors isolated from five healthy donors
[in-house dataset; mean age, 22.4 (21 to 24)] ([165]Fig. 4B; figs. S1B
and S8, A to I; and table S1). This approach allowed us to explore the
specific TF expression and the activity of regulons over the entire B
cell differentiation, including those cellular states not currently
accessible by flow cytometry–based isolation.
Fig. 4. Translation of the early B cell differentiation regulatory landscape
to a single-cell resolution.
[166]Fig. 4.
[167]Open in a new tab
(A) scRNA-seq UMAP dimensionality reduction of 10,529 human bone marrow
cells, recovering the B cell lymphopoiesis (public dataset
phs002371.v7.2). (B) scRNA-seq UMAP dimensionality reduction of 28,891
human CD34+ cells, drawing the early B cell lymphopoiesis (in-house
dataset). (C and D) Heatmap showing, for each cell (columns; grouped by
cell subpopulation), the estimated gene set activity score of each TF
regulon–activated gene set by calculating the AUCell score (rows) for
the public and in-house datasets, respectively. Row-scaled AUCell score
values are represented. TF regulons are clustered and colored according
to TF regulon groups identified in [168]Fig. 2A (left-side colored
bar). (E and F) Dot plot showing, for each cell subpopulation, the
average gene expression and the percentage of expressing cells for
PPARA, ELK3, and NR3C1 (TF regulons identified in [169]Fig. 2 as pro–B
cell–type specifics) and MYBL2, FOXM1, and E2F7 (cell cycle–related
TFs) in the public and in-house datasets, respectively. The average
gene expression is shown in a color range, and the percentage of
expressing cells is indicated by dot size. (G to I) Box plot
representation of the AUCell gene set activity score for the MYBL2,
FOXM1, E2F7, ELK3, NR3C1, and PPARA TF regulon–activated gene sets by
cell subpopulation for public (top) and in-house (bottom) datasets.
We estimated the regulon activity per cell and per regulon by
calculating the AUCell score for each TF regulon–activated gene set.
Briefly, AUCell scores gene set activity in single cells by ranking
expression and calculating the AUC of gene enrichment per cell.
Therefore, we quantified the enrichment of the positively regulated
genes of each regulon within all expressed genes for each cell. In both
datasets, the resulting regulon enrichment patterns closely mirrored
those identified in bulk analysis ([170]Figs. 2A and [171]4, C and D),
showing a moderate-to-strong correlation in more than 79% of regulons
(fig. S9). Despite the similarity, the granularity of the single-cell
data uncovered an increased enrichment of the regulons grouped in the
blue cluster with cycling cells according to one of the main biological
pathways associated with this cluster (cellular proliferation; see
[172]Fig. 2C). Within this group of TFs, we identified MYBL2, FOXM1,
and E2F7 showing elevated gene expression and regulon activity in
cycling cells ([173]Fig. 4, E to G).
In addition, we observed the increased gene expression of ELK3 and
NR3C1 in pro–B cells ([174]Fig. 4, E and F) along with a refined
resolution of their regulon activity profiles ([175]Fig. 4H). The NR3C1
regulon, which was highly active in pro–B cells and pre–B cells
([176]Fig. 2A), showed the highest activity in cycling pre–B cells. In
the case of the ELK3 regulon, which was highly active throughout the
early stages of B cell differentiation—from hematopoietic stem cells
(HSCs) to pro–B cells—and significantly decreased during the transition
from pro–B cells to pre–B cells ([177]Fig. 2A), a delayed reduction in
activity was observed, with the most notable drop occurring between
cycling pre–B cells and pre–B cells ([178]Fig. 4H). PPARA displayed a
similar pattern of regulon activity to ELK3 ([179]Figs. 2, A and E, and
[180]4I); however, its gene expression was slightly detected in CLPs
and pre-pro–B cells and was largely absent in the other cell
subpopulations ([181]Fig. 4, E and F). In summary, the single-cell
translation not only validated the cell type specificity of regulons
but also provided high-resolution insights into their activity dynamics
throughout B cell differentiation.
Identification of relevant B cell subpopulations and regulons within B-ALL
subtypes
Given the comprehensive regulatory landscape of our human B cell
profiling, we next turned our attention to the translatability of our
results within a pathologic context. We hypothesized that the accurate
characterization of the B cell differentiation would provide insights
into the processes underlying the dysregulation of precursor B cell
differentiation–related diseases, such as B-ALL.
To investigate our hypothesis, we obtained a large public dataset
including several B-ALL subtypes from both child and adult individuals
(Li dataset; [182]Fig. 5A and fig. S10, A to C) ([183]40). Next, we
aimed to investigate the distribution of each TF regulon activity
(focused on activated genes and using GSVA) in each individual sample.
Investigating the derived heatmap, we did not identify a perfect
association between regulatory programs and the different genetic
subtypes ([184]Fig. 5B). However, we found consistently increased
activity scores in patients with B-ALL from the TCF3-PBX1 (pre–B cell
leukemia homeobox 1) and MEF2D fusion subtypes within pre–B
cell/immature B cell regulons (pink cluster). Also, within patients
from the ETV6/RUNX1 subtype, a subtle increased activity was observed
for some of the regulons classified as pro–B cell regulons (green
cluster), including OTX1 (orthodenticle homeobox 1), IRX2 (iroquois
homeobox 2), and AIRE (autoimmune regulator) (fig. S11).
Fig. 5. Uncovering B-ALL through the normal B cell differentiation regulatory
landscape.
[185]Fig. 5.
[186]Open in a new tab
(A) Schematic representation of the B-ALL public Li dataset explored,
including 469 individuals (childhood and adult) from three different
cohorts. The PCA of gene expression is shown. Each dot represents a
sample colored by B-ALL genetic subtype. (B) Heatmap representation of
the gene set activity score (GSVA) of each TF regulon–activated gene
set (rows) computed for each B-ALL sample (n = 469; in columns).
Row-scaled GSVA score values are represented. TF regulons are clustered
and colored according to TF regulon groups identified in [187]Fig. 2A
(left-side colored bar). Patients with B-ALL are clustered according to
B-ALL subtype (top colored bar). The sample age (adult versus
pediatric) is shown for each sample (top colored bar). (C) ORA of B-ALL
subtype signatures over the B cell subpopulation signatures. The dot
size represents the −log[10](adjusted P value), and the color range
represents the gene ratio. (D) ORA for RUNX1-ETV6 high-affinity binding
sites (n = 711) over the consensus OCRs for each B cell subpopulation.
The dot size represents the −log[10](adjusted P value), and the color
represents the odds ratio. (E) ORA of B-ALL subtype signatures (rows)
over the TF regulon–activated gene set (columns) defined by the bulk
approach. The dot size represents the −log[10](adjusted P value), and
the color range represents the gene ratio. Only TFs with at least one
significant association are shown. TF regulons clustered and colored
according to TF regulon groups identified in [188]Fig. 2A (top-side
colored bar). (F to I) Box plot showing the distribution of the gene
set activity score (GSVA) computed on the TF regulon–activated gene set
for TCF3, MEF2D, FOXD3, SPI1, OTX1, IRX2, STAT5B, and FOSB TF regulons
by B-ALL subtype. This analysis is shown for the Li dataset and two
additional datasets for validation [Rainer and Iacobucci (Iaco.)], when
possible.
To further investigate those initial observations, we first aimed to
compare what defines transcriptional B-ALL subtypes compared to those
of normal BCPs. To this end, we analyzed the enrichment of the
transcriptional signatures for each of the B-ALL subtypes (fig. S10, D
and E, and table S12) and transcriptional signatures of normal BCPs
(fig. S4) using an overrepresentation analysis (ORA). Our analysis
confirmed a statistically significant association of MEF2D fusions and
TCF3-PBX1 ALL with pre–B cells and immature B cells, ETV6-RUNX1/like
enriched in pro–B cells, and pointed to a significant association of
BCR-ABL1 (Abelson tyrosine-protein kinase 1) with pro–B cells, ZNF348
(zinc finger and BTB domain containing 33) fusions with HSPCs, and the
KMT2A (lysine methyltransferase 2A) fusion subtype with HSPCs/CLPs
([189]Fig. 5C). Moreover, when considering the chromatin accessibility
data, we explored the enrichment of a set of 711 high-affinity binding
sites for the ETV6-RUNX1 fusion defined by Wray et al. ([190]41) in our
chromatin accessibility data, which confirmed again the significant
association of the ETV6-RUNX1 ALL subtype with the pro–B cell state
([191]Fig. 5D).
Once we observed significant associations between B-ALL subtypes and
normal BCPs, we focused on identifying significant associations at the
TF regulon level. To assess this, we calculated the enrichment of the
transcriptional signatures of B-ALL subtypes in the activated target
genes of regulons using the ORA. Our analysis uncovered specific TF
regulons enriched in some of the B-ALL subtypes (see all regulons in
[192]Fig. 5E). Some of the identified regulons were validated using two
independent B-ALL public datasets: one including childhood ALL with
different B-ALL subtypes (Rainer dataset, [193]GSE73578) ([194]42) and
another including adult B-ALL with or without BCR-ABL1 mutation
(Iacobucci dataset, [195]GSE13204) (fig. S10, F to I) ([196]43). Among
those confirmed, we observed the expected elevated activity of the TCF3
regulon in the TCF3-PBX1 subtype and the MEF2D regulon in the MEF2D
fusion subtype, which was also increased in the TCF3-PBX1 subtype
([197]Fig. 5F). In addition, we validated an increased activity of
FOXD3 and SPI1 regulons in both subtypes, among others ([198]Fig. 5G).
Although not significant in the ORA ([199]Fig. 5E), we identified an
increased activity of OTX1 and IRX2 regulons—both associated with the
pro–B cell state; see [200]Fig. 2—in the patients with ETV6/RUNX1 ALL
([201]Fig. 5H). Last, in the BCR-ABL1/ph–like ALL subtype, we detected
a significantly increased activity score for STAT5B and
FOSB—characteristics of early progenitor cells (cluster in
cyan)—regulons with a particularly strong increase in the ZNF384 fusion
subtype ([202]Fig. 5I). In summary, by translating our comprehensive
resource to the B-ALL context, we illustrated their potential to obtain
a deeper understanding of the molecular underpinnings of B cell
malignancies.
DISCUSSION
We addressed the challenge of profiling gene expression and chromatin
accessibility during the early stages of human B cell lymphopoiesis to
elucidate the underlying GRN that orchestrates the differentiation
process. To this end, to our knowledge, we compiled the most
comprehensive sample collection to date, encompassing bulk-derived
multiomics profiles of eight BCPs from 13 healthy donors
(EGAS00001007296), publicly available for browsing at
[203]https://translationalbio.shinyapps.io/brex/. This dataset,
characterized by unprecedented granularity, provided the statistical
power necessary for a detailed characterization of transcriptional and
chromatin accessibility dynamics, allowing us to move beyond the
transcription-based understanding of the gene regulatory landscape.
The definition of the BCP subpopulations has been well established in
mice—the Hardy fractions—([204]44); however, its translation to human
surface markers remains controversial, and an unequivocal assignment is
missing ([205]45). Here, we defined a sorting strategy that allowed the
recovery of the human cell fractions equivalent to HSPCs, CLPs, pro–B
cells (including FrB/C), pre–B cells (including FrC′/D), immature B
cells (FrE), translational B cells (FrD), and naïve B cells (FrD)
([206]Fig. 1, A and H), providing an unparalleled resource to robustly
understand the system and complete what has been previously achieved in
mice or mature stages of human B cells ([207]3, [208]19, [209]46,
[210]47). Although the defined strategy was accurate, we would like to
note that, because of sorting limitations, the CLP fraction likely also
contains early lymphoid progenitor cells (CD34^+CD38^+CD45RA^+CD133^+).
Both gene expression and chromatin accessibility profiles delineate
early B cell differentiation and recovered well-known biomarkers,
biological processes, and key TFs ([211]Fig. 1). Chromatin
accessibility and, more specifically, intronic and distal intergenic
regions yield sharper distinctions between cell subpopulations than
promoter regions or expression profiles (figs. S2A and S12, A and B),
consistent with previous observations in the mouse immune system and
human hematopoiesis ([212]19, [213]48). This increased variability
supports the crucial role of enhancer elements in the B cell
development regulation (fig. S12C).
Given that gene expression is regulated by chromatin accessibility, in
combination with other epigenetic factors ([214]49), we designed a
custom joint analysis—drawing inspiration from a previous study using
mouse bulk data ([215]19)—to reconstruct the GRN underlying early B
cell differentiation (fig. S2). Although the computational approach
involves several assumptions and may yield some false positive edges,
the resulting GRN serves as a robust and valuable resource for
hypothesis generation. This is supported by the following: (i) We used
significant “OCR-gene” pairs derived from a large number of paired
samples; (ii) we used footprinting, enabled by deep sequencing of
ATAC-seq (more than 100 million reads per sample), to improve the
estimation of active TF binding sites; (iii) the GRN recovers the
already well-established regulators; and (iv) a specific GRN for each
cell state can be derived, permitting us to investigate further the
cell type specificity of the regulons ([216]Fig. 2). Moreover, we
leveraged public datasets to confirm that the defined CREs are enriched
by active histone marks ([217]50) and located within A-type
compartments of chromatin conformation (fig. S13, A and B) ([218]51).
Also, using chromatin conformation data from naïve B cells ([219]52),
we validated several CREs associated with characteristic genes from the
late stages of differentiation, including MS4A1 (CD20), CD79A/B, PAX5,
and IKZF1 (fig. S13C). However, Hi-C or promoter Hi-C experiments
covering the early B cell differentiation—currently unavailable—will be
essential for refining the defined GRN and allowing further
interrogation of the cis-regulatory architecture, possibly leveraging
deep learning models. Also, chromatin immunoprecipitation sequencing
(ChIP-seq) experiments can help to validate the regulons identified.
The recovered GRN not only captured well-known regulators extensively
studied in mice but also validated recently described genes
contributing to human lymphoid lineage differentiation, such as MLXIP
([220]53). In addition, it provided a valuable classification of highly
cell type–specific regulons and enabled the identification of master
regulators, shedding light on understanding B cell differentiation and
the transition to malignancy. Our GRN identified ELK3 as a putative
candidate regulatory element of the early stages of B cell
differentiation ([221]Fig. 2, A, C, and E), potentially involved in
cell cycle progression mainly in pro–B cells ([222]Fig. 3). This TF has
recently been defined as a regulon in large and small B cells
(FrC′/D—pre–B cells) in the bursa of Fabricius of young broilers
([223]36), and it was associated with pro–B cell expansion in mice
([224]35). Furthermore, evidence supporting a role for ELK3 in human B
cell differentiation has been reported recently ([225]37). Moreover,
ELK3 has been described as a novel oncogene reported to be increased in
several carcinogenic processes, including the ETV6-RUNX1–positive ALL,
a kind of ALL that resembles the pro–B cell state and shows the
down-regulation of the G[2]-M checkpoint pathway ([226]22,
[227]54–[228]56). ELK3 inhibition has been revealed as a potential
candidate therapy for tumor growth and metastasis in mice and prostate
cancer cells ([229]57–[230]59). Further functional studies are needed
to better understand the regulatory role of ELK3 in both healthy and
disease contexts.
To achieve a higher cellular resolution, which is not feasible with
sorting strategies, we projected the defined bulk-derived regulons onto
single-cell data. Analysis of two independent datasets further
validated and refined the cell type specificity of TF regulons, which
included the regulatory control of ELK3. Notably, MYBL2, FOXM1, and
E2F7 regulons were identified as highly active in cycling cells,
consistent with their known roles in regulating cell proliferation
([231]47, [232]60, [233]61). When we inferred regulons directly from
the multiomics single-cell dataset (fig. S8, A to C, J, and K), we
observed only 62 regulons (table S11), of which 34 were considered
significant (21 activators, ρ > 0.7; 13 repressors, ρ < −0.70). Most of
the activator regulons identified (17 of 21) matched those from our
bulk approach, including HLF, MECOM, MEIS1, NFE2, PBX1, RUNX1/2, CEBPA,
THRB, LYL1, FOS, FOSB, MYBL2, LEF1, TCF4, EBF1, and PAX5. However,
despite the relevant overlap between both approaches, it became clear
that single-cell data did not recapitulate the broader set of regulons
identified through footprinting-based bulk analysis. This comparison
underscores the need for integrative strategies that harness the depth
of bulk data and the resolution of single-cell data to comprehensively
define GRNs.
Last, the translational relevance of our resource was demonstrated by
leveraging the comprehensively defined regulatory landscape of early
human B cell lymphopoiesis to study B-ALL malignancy. This revealed
substantial heterogeneity in regulatory programs across B-ALL subtypes,
with distinct associations between specific early B cell states and
certain genetic subtypes, including MEF2D fusions, TCF3-PBX1, and
ETV6-RUNX1/like. These findings support the concept of heterogeneous
origins of B-ALL transformation and reflect the cell of origin or
cellular state of malignant transformation, as recently described by
Iacobucci et al. ([234]23) ZNF384 fusions were associated with HSPCs,
potentially explaining the preference for allogeneic hematopoietic stem
cell transplantation in this subtype, as well as the enrichment of
HSC-like features in TCF3-ZNF384–positive patients ([235]62, [236]63).
Similarly, KMT2A fusions showed a strong association with HSPCs and
HSPC-associated regulons ([237]Fig. 5, C and E), supporting previously
reported increases in lineage plasticity ([238]39). In addition,
specific regulatory programs were identified and validated for various
B-ALL subtypes, including the well-known increased activity of MEF2D
and TCF3 in MEF2D fusions and TCF3-PBX1 subtypes, respectively
([239]64). We also identified candidate regulatory targets in other
B-ALL subtypes, such as OTX1 and IRX2 in ETV6-RUNX1/like B-ALL. IRX2
has also been described as aberrantly expressed in other B-ALL
subtypes, such as BCR/ABL1 and hyperdiploid karyotypes ([240]65).
Moreover, OTX homeobox genes have been previously associated with B
cell differentiation in Hodgkin lymphoma ([241]66).
Beyond these biological insights, our work introduces a potential
strategy for patient stratification based on active regulatory
programs, either as an alternative to or in combination with genetic
classification. Patient-specific regulon activity profiles could offer
critical insights into treatment response and enable more personalized
therapeutic approaches, supporting the recently uncovered correlation
between developmental states and asparaginase response in B-cell ALL
([242]67). To briefly explore this concept, we performed a preliminary
exploration of regulon activity in relation to treatment response using
two public datasets: one evaluating CART19 (CD19-targeted chimeric
antigen receptor T cell) therapy ([243]68) and another assessing the
response to SMAC (second mitochondria-derived activator of caspase)
mimetics ([244]69). Preliminary results suggest that nonresponders
exhibit higher activity of stemness and proliferative-related regulons
compared to responders (fig. S14).
Last, it is important to acknowledge that, despite the high resolution
achieved and the promising translatability for understanding underlying
mechanisms in malignant contexts, much of the initial analysis relied
on predefined early B cell subtypes. While single-cell mapping improved
the dynamic resolution of regulons, their definition and associated
GRNs remained largely confined to established cell populations.
Consequently, part of the study is still limited to known cell types,
and the existence of additional, yet uncharacterized, subpopulations
cannot be ruled out. In this context, future CITE-seq (cellular
indexing of transcriptomes and epitopes by sequencing) experiments may
provide deeper insights by enabling more refined discrimination and
isolation of cell types ([245]70). In summary, the generation of a
comprehensive robust atlas of regulatory elements governing early human
B cell differentiation provides a powerful resource to deeply
interrogate the GRNs and epigenetic components that distinguish healthy
and disease cell states.
MATERIALS AND METHODS
All data and codes used in the project can be found on GitHub
([246]https://github.com/TranslationalBioinformaticsUnit/bcell) and
Open Science Framework (OSF; [247]https://osf.io/gswpy/).
Human sample collection
Human bone marrow and peripheral blood samples from the same
individuals were collected for 13 healthy donors [7 males and 6
females; median age, 21 (20 to 23)] (table S1). Samples and data from
donors included in the study were provided by the Biobank of the
University of Navarra after informed consent was obtained and were
processed following standard operating procedures. The Clinical
Research Ethics Committee of the Clinica Universidad de Navarra
approved the study (ethical approval 2021.035).
FACS sorting
Mononuclear cells were purified from whole blood and bone marrow
samples using Ficoll-Paque PLUS (GE Healthcare, IL). After that, eight
immune cell populations, representing the lymphoid hematopoietic
differentiation cascade along the B cell lineage from adult human bone
marrow precursors and peripheral blood, were obtained from each sample.
These cells were purified following the sorting strategy shown in table
S1 and fig. S1A in a BD FACSAria II (BD Biosciences) and then analyzed
by bulk RNA-seq and bulk ATAC-seq.
RNA-seq data generation
Bulk RNA-seq was performed following the MARS-seq (massively parallel
RNA single-cell sequencing framework) protocol adapted for bulk RNA-seq
([248]71, [249]72) with minor modifications. Briefly, cells were sorted
in 100 μl of Lysis/Binding Buffer (Ambion), vortexed, and stored at
−80°C until further processing. Poly-A RNA was selected with Dynabeads
Oligo (dT) (Ambion) and reverse transcribed with AffinityScript
Multiple Temperature Reverse Transcriptase (Agilent) using poly-dT
oligos carrying a 7–nucleotide (nt) index. Up to eight samples with
similar overall RNA content were pooled and subjected to linear
amplification by in vitro transcription using the HiScribe T7 High
Yield RNA Synthesis Kit (New England Biolabs). Next, RNA was fragmented
into 250- to 350-nt fragments with RNA Fragmentation Reagents (Ambion)
and dephosphorylated with FastAP (Thermo Fisher Scientific) following
the manufacturer’s instructions. Next, partial Illumina adaptor
sequences were ligated with T4 RNA Ligase 1 (New England Biolabs),
followed by a second reverse transcription. Full Illumina adaptor
sequences were added during final library amplification with KAPA HiFi
DNA Polymerase (Kapa Biosystems). RNA-seq library quality controls
consisted of quantification with a Qubit 3.0 Fluorometer (Life
Technologies) and size profile examination using Agilent’s 4200
TapeStation System. Libraries were sequenced in an Illumina NextSeq 500
in a single end at a sequence depth of 10 million reads per sample.
RNA-seq quantification and normalization
Read quality was assessed with FastQC software (Babraham Institute).
Low-quality, short-read, and TruSeq3-SE adapters were trimmed using
Trimmomatic ([250]73) under the following parameters:
ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
MINLEN:30. Reads were then mapped to the December 2013 Homo sapiens
high-coverage assembly GRCh38 from the Genome Reference Consortium
GRCh38 (hg38) using STAR (version 2.6.0a) ([251]74) and counted for
each gene using htseq-count ([252]75) with the –m intersection-nonempty
option and a GTF file from GRCh38 Ensembl annotation release 96,
getting the information for a total of 58,884 genes. Genes with a
minimum read count of 1 in all population replicates and an associated
ensemble ID were retained (23,454 genes). A sample identified as an
outlier has been removed (sample ID: D213-4). Counts were normalized
under trimmed mean of M-values (TMM) normalization and voom
transformation from the edgeR r-package (version 3.24.3) ([253]76).
Correlation between samples was computed using Pearson correlation
(fig. S10A).
RNA-seq data analysis
PCA was used for data exploration ([254]Fig. 1B). Differential
expression analysis was performed using limma (version 3.38.3) and
edgeR (version 3.24.3) under two different approaches: (i) First is
identifying biomarker genes for each cell subpopulation (all possible
pair combinations). Genes with an adjusted P value ≤0.01
(Benjamini-Hochberg) and fold change ≥1.5 at least in four contrasts
were defined as significant (B cell subpopulation biomarkers; fig. S4).
(ii) Second is identifying DEGs between transition states (HSC versus
CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.). Genes
with an adjusted P value ≤0.05 (Benjamini-Hochberg) and absolute fold
change ≥1.5 were considered significant. A clustering analysis was
performed to identify the different expression patterns within the DEGs
over transitions (second approach) using a model-base clustering based
on parameterized finite Gaussian mixture models [mclust r-package
(version 5.4.5) under the following function and parameter: Mclust
(modelName = “EII”)] (fig. S6A) ([255]77). Gene set enrichment analysis
of Gene Ontology (GO) terms was performed using the ClusterProfiler
r-package (version 4.6.0) ([256]78). The fold change of DEGs at
different transitions was given as input to the compareCluster
function, and the top 10 enriched terms were plotted using an in-house
treeplot function ([257]Fig. 1F).
ATAC-seq data generation
Accessible chromatin mapping was performed using FAST-ATAC-seq
([258]48) with minor modifications. Briefly, 10,000 cells were sorted
in 1× phosphate-buffered saline (PBS) and 0.05% bovine serum albumin
(BSA) and pelleted by centrifugation at 500 rcf for 5 min at 4°C with
low acceleration and brake settings in a precooled swinging-bucket
rotor centrifuge. All the supernatant but 5 μl was removed. Next, 25 μl
of transposase mix [15 μl of 2× TD buffer (Illumina), 1 μl of TDE1
(Illumina), 0.25 μl of 5% digitonin (Promega), and 8.75 μl of
nuclease-free water] was added to the cells, and the pellet was
disrupted by pipetting. Transposition reactions were incubated at 37°C
for 30 min in an Eppendorf ThermoMixer with shaking at 450 rpm.
Reactions were stopped at 4°C for 5 min. To release tagmented DNA,
samples were incubated at 40°C for 30 min with 5 μl of cleanup buffer
[900 mM NaCl (Sigma-Aldrich), 30 mM EDTA (Millipore), 2 μl of 5% SDS
(Millipore), and 2 μl of proteinase K (NEB)]. DNA was purified using a
2× SPRI beads cleanup kit (Agencourt AMPure XP, Beckman Coulter). Two
sequential polymerase chain reactions (PCRs) were performed using KAPA
HiFi DNA Polymerase and customized Nextera PCR indexing primers (IDT)
([259]79). The conditions of the first PCR were as follows: 5 min at
72°C and 2 min at 98°C followed by nine cycles of 20 s at 98°C, 30 s at
63°C, and 1 min at 72°C. Depending on the library concentration in this
first PCR, a second PCR (2 min at 98°C followed by four to six cycles
of 20 s at 98°C, 30 s at 63°C, and 1 min at 72°C) was performed, aiming
for a library concentration in the range of 2 to 10 ng/μl. PCR products
were purified using a 2× SPRI beads cleanup kit. Libraries were
quantified, and their size profiles were examined as described above.
Sequencing was carried out in two batches: the first one in an Illumina
NextSeq 500 using paired-end, dual-index sequencing (Read1: 38 cycles;
Read2: 38 cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 10 million
reads per sample and second one in a HiSeq2500, v4 reagents using
paired-end, dual-index sequencing (Read1: 125 cycles; Read2: 125
cycles; i7: 8 cycles; i5: 8 cycles) at a depth of 100 million reads per
sample.
ATAC-seq quantification and normalization
Read quality was assessed with FastQC software (Babraham Institute).
After trimming NexteraPE-PE adaptor sequences and low-quality reads and
cutting the reads from the second batch to 38 base pairs (bp) using
Trimmomatic ([260]73) under the following parameters: CROP:38
ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3
SLIDINGWINDOW:4:15 MINLEN:15. Reads were then mapped to the hg38
reference genome (GRCh38 Ensembl annotation release 96) using BWA-MEM
([261]80). Unmapped, unknown, random, ChrM mapping, duplicate reads and
reads matching the black list [the blacklist includes problematic
regions of the genome ([262]81)] were filtered out using samtools view
-b -F 4 -F 1024 (samtools version 0.1.19) and Picard Tools (Picard
MarkDuplicates; [263]http://broadinstitute.github.io/picard). BAM files
from resequenced samples were merged using samtools merge. Paired-end
reads were used to determine those regions with significant read
coverage in contrast to the random background (regions known as peaks
or OCRs) using MACS2 (function callpeak) with the following parameters:
-g hs -f BAMPE -q 0.01 --nomodel --shift -100 --extsize 200 -B –SPMR
([264]https://github.com/taoliu/MACS) ([265]82). A minimum OCR size of
20 bp was set as a threshold. To get the consensus OCRs for each cell
subpopulation, we found the genome regions that are covered by at least
n − 1 of the sets of peaks (n being the total number of samples for a
specific cell subpopulation), and then we merged the nearby peaks with
a minimum gap width of 31. To get the consensus OCRs in the whole
population, we get the regions in the genome where the coverage is at
least in 1 of the consensus OCRs from the different cell subpopulations
and then merged the nearby peaks with a minimum gap width of 31,
resulting in 105,530 ATAC-seq OCRs. These processes were done using the
rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2)
r-packages ([266]83, [267]84).
BigWig files were generated for each BAM file using the bamCoverage
function (--binSize 20 --normalizeUsing RPKM --effectiveGenomeSize
2913022398) from deepTools (version 3.3.1) ([268]85). Then, a merged
BigWig for each cell subpopulation was obtained through bigWigMerge
from UCSC
([269]http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/). Also,
BigBed files were generated from consensus BED files of each cell
subpopulation using bedToBigBed from UCSC. These data were deposited in
a public custom track from UCSC
([270]https://genome.ucsc.edu/s/TransBioLab/hg38_Brex).
Using the whole consensus OCRs, we compute the count matrix using the
multicov function from bedtools (version 2.25.0) ([271]86). Those OCRs
without a minimum of 10 counts for at least one sample and 15 total
counts were excluded (155 OCRs), retaining 105,359. Then, we normalize
the data using the TMM method from the edgeR r-package (version 3.30.3)
([272]76). One of the samples analyzed was identified as an outlier and
excluded (sample ID: D215-5). Correlation between samples was computed
using Pearson correlation (fig. S10B).
ATAC-seq data analysis
PCA was used for data exploration ([273]Fig. 1C). OCRs were mapped to
genomic locations (promoter, intronic, distal intergenic, etc.) using
the ChIPseeker r-package (version 1.18.0) ([274]87) under the following
arguments: region range of transcription start site (TSS) as +/− 1000
bp, gene level, “TxDb.Hsapiens.UCSC.hg38.knownGene” TxDb object
(version 3.4.0) and “org.Hs.eg.db” annotation package (version 3.7.0)
(fig. S2A). The Gini index was computed for each OCR to measure
“chromatin inequality” (fig. S10B). A Gini index of 0 represents
perfect equality and similar chromatin accessibility in all cell
subpopulations, while a Gini index of 1 represents perfect inequality,
all the accessibility OCRs come from a specific cell subpopulation, and
the others show no accessibility. To compute the Gini index, the
maximum chromatin accessibility observed from the normalized count
matrix was assigned for each cell subpopulation. Then, the gini
function from the REAT r-package (version 3.0.2) was applied ([275]88).
Differential accessibility analysis was performed to identify
differential accessibility regions between transition states (HSC
versus CLP, CLP versus pro–B cell, pro–B cell versus pre–B cell, etc.)
using a quasilikelihood negative binomial generalized log-linear model
(glmQLFit) from edgeR (version 3.24.3) ([276]76). OCRs with an adjusted
P value ≤0.01 (Benjamini-Hochberg) and absolute fold change ≥2 were
defined as significant. A clustering analysis was performed to identify
the different accessibility patterns within the differentially
accessible regions (DARs) under a model-base clustering based on
parameterized finite Gaussian mixture models [mclust r-package (version
5.4.5) under the following function and parameter:
Mclust(modelName = “EII”)] (fig. S6B) ([277]77). To explore the
accessibility of TFs, we computed the aggregated motif score (TFBS
accessibility score) for each TF using the chromVAR r-package (version
1.5.0) ([278]89) with the default parameters. For plotting, the TFBS
score was used ([279]Fig. 1E and fig. S3).
Gene set variation analysis (GSVA)
We performed the GSVA to compute the enrichment of specific gene sets
within each individual sample. We applied GSVA to explore the
transcriptional proximity of different B cell states between humans and
mice ([280]GSE109125) ([281]Fig. 1H) ([282]19). To this end, we used
the defined biomarker gene sets for each B cell subpopulation in humans
(fig. S4) and computed the GSVA enrichment scores using the algorithm
of Hänzelmann et al. ([283]90) by calling the gsva() function from the
GSVA R package (version 1.50.5). We also used GSVA to assess the gene
set enrichment of TF regulons in each B-ALL sample. We computed the
GSVA of the TF regulons within the datasets of Li, Rainer, Iacobucci,
Singh, and Zinngrebe ([284]Fig. 5, B and F to I, and fig. S12). To
explore significant differences at the TF regulon level regarding
treatment response, we used the GSVA results from the Singh and
Zinngrebe datasets as input and performed differential analysis using
limma (version 3.58.1) and sva (version 3.50.0) to adjust for sample
heterogeneity using surrogate variable analysis ([285]91). TF regulons
with an adjusted P value ≤0.05 (Benjamini-Hochberg) were defined as
significant (fig. S12).
OCR variance component analysis
We used a linear mixed model approach for assessing the proportion of
variance in gene expression explained by promoter and distal enhancer
OCRs’ activity (fig. S10C). For each gene, we fit the following linear
mixed model regression
[MATH: Y=U_promoter+<
mi
mathvariant="normal">U_enhancers+
Epsilon :MATH]
where Y is a vector of zero-centered gene expression values, U_promoter
and U_enhancer are random effects, and Epsilon explains noise
unaccounted by the model. The limix python library was used for fitting
the regression models ([286]https://github.com/limix/limix) ([287]92).
OCRs association with target genes (CREs)
To determine the CREs of a gene (OCR-gene pairs), we follow two
approaches: (i) For promoter OCRs, we directly annotate OCRs to genes
based on TSS distance (<1 kb), and (ii) for intronic or distal
intergenic OCRs, we explored the association between chromatin
accessibility and gene expression. First, we identified all intronic or
distal intergenic OCRs for each expressed gene within a window of ±1 Mb
from the gene’s TSS. Second, the Spearman correlation was computed
across all samples (59 samples; eight cell subpopulations) for each
OCR-gene pair (fig. S2B). Bonferroni correction was used to adjust for
multiple testing. OCR-gene pairs with an adjusted P value <0.05 were
considered significant. A complete list of pairs (62,559) is available
in OSF ([288]https://osf.io/gswpy/).
OCRs association with TFs (trans-regulatory elements)
To annotate OCRs to putative TF binding motifs (TF-OCR pairs), we first
identify the footprints within our OCRs. For each sample, we recognize
the footprints using two algorithms, HINT ([289]93) and Wellington
([290]94), using as input the corresponding sample BAM file and BED
file from the corresponding cell subpopulation. For the footprints
identified by HINT, we filtered out those footprints with less than 20
counts. Then, we get the union of the footprints identified for each
sample through both strategies. Next, we compute the consensus
footprint for each cell subpopulation following a similar strategy as
for the OCR consensus. To get the consensus footprint for each cell
subpopulation, we found the genome regions covered by at least n − 1 of
the sets of regions (n being the total number of samples for a specific
cell subpopulation). Last, to get the consensus footprint in the whole
population, we get the regions in the genome where the coverage is at
least in 1 of the consensus footprint from the different cell
subpopulations, and then those with a width less than 6 bp were
removed, resulting in a total of 380,922 ATAC-seq footprints. These
processes were done using the rtracklayer (version 1.58.0) and
GenomicRanges (version 1.50.2) r-packages ([291]83, [292]84). The
identified footprints were located in 72,771 OCRs (69%).
After identifying the footprints, we want to annotate the putative TF
binding motif to each footprint. To do that, we used the Vierstra
database ([293]95). First, we generate a reduced version of the
Vierstra database, getting those annotations falling within our OCR
regions using TABIX ([294]96). Then, we annotated at the archetype
level (version 1.0) to reduce motif redundancy and overlap, resulting
in 30,258,213 region-motif annotations and 286 motif archetypes. The
overlap between footprints and motif archetypes was explored, defining
a minimum overlap of 90% of archetype width. Last, footprints were
linked to the original OCR to get the OCR-archetype matrix, resulting
in a matrix 105,530 by 282.
To translate archetypes to TFs, we define the following strategy.
First, we compute the aggregated motif score (TFBS accessibility score)
for each archetype using the chromVAR r-package (version 1.5.0)
([295]89) with the default parameters. Then, to identify the TFs most
likely associated with an archetype, we compute, for each archetype,
the Spearman correlation between the TFBS score of the archetype and
the RNA expression of each TF clustered in that archetype. Correction
for multiple comparisons was done using the Benjamini-Hochberg method.
Moreover, the TFBS score variability and the coefficient of variation
of TF gene expression were computed. We determine the TF-archetype
association when (i) significant variation for both the TFBS score and
TF gene expression and significant and positive correlation is observed
(adjusted P value <0.05 and ρ > 0) and (ii) no considerable variation
is observed for the TFBS score (<1Q TFBS variability) or TF gene
expression (<1Q RNA coefficient of variation), independently of the
correlation result, assuming that correlation will not be meaningful
because of the low variability of the variables assessed. Last, we
obtained a list of archetype-TF associations that allowed us to
translate archetypes to TFs and get the OCR-TF matrix (105,530 by 275)
with 60,881 OCRs linked at least to one TF motif.
Generating the GRNs and regulons of B cell differentiation
On the basis of the different associations of OCR-gene pairs
(cis-regulation; 62,559 links) and TF-OCR pairs (trans-regulation;
2,299,328 links), we defined the GRN for early B cell differentiation
(fig. S2B). To do that, from the defined OCR-gene and TF-OCR pairs, we
first keep those interactions derived from the DEGs (differential
expression at least in one transition), their promoter regions, and the
associated enhancer elements with differential accessibility
(differential chromatin accessibility at least in one transition) under
the assumption that those genes and OCRs changing across cell
subpopulations are going to be the ones governing the B cell
differentiation process. Then, having the putative TF-OCR-gene
interactions (n = 1,151,024), the correlation between TF-OCR and
TF-gene was computed using Spearman correlation. To reduce spurious and
meaningless TF-OCR-gene interactions, we excluded from the GRN those
where (i) the TF showed no differential expression (n = 469,073); (ii)
TF-gene association was not significantly correlated (Spearman P value
>0.05; n = 186,837); (and iii) meaningless interactions including (a)
positive correlation for TF-gene and TF-OCR but negative correlation
for TF and gene, (b) negative correlation for TF-gene and TF-OCR but
positive correlation for TF and gene, (c) positive correlation for
TF-gene and TF-OCR but negative correlation for OCR-gene, (d) negative
correlation for TF-gene, TF-OCR, and OCR-gene, and (e) positive
correlation for TF-gene and OCR-gene but negative for TF-OCR
(n = 117,846). Last, 377,268 TF-OCR-gene interactions were retained to
reconstruct the GRN of the B cell differentiation process, involving
169 TFs, 7310 genes, and 16,074 OCRs. By exploring the TF-gene
associations, we defined specific gene sets for each TF, referred to as
regulons (TF regulons).
Specific GRNs for cell subpopulations (csGRN) were computed on the
basis of the chromatin accessibility and gene expression for each cell
subpopulation. For example, for HSCs, the GRN captures those
TF-OCR-gene interactions where genes were highly expressed in that cell
subpopulation according to gene expression patterns (fig. S6A; genes
included in brown clusters), and a median of normalized chromatin
accessibility ≥1 (defining accessible OCR) was observed.
Regulon cell type specificity and functionality analysis
To evaluate the cell type specificity of each predicted regulon, we
calculated the RSSs per cell type and regulon. The RSS algorithm is
based on the JSD, a measure of the similarity between two probability
distributions ([296]25, [297]97). Specifically, we used the binary
enrichment of regulon activity (based on target regions or target
genes) together with the cell assignments to see how specific each
predicted regulon is for each cell type. To compute the binarized
regulon activity, we used the GSVA score instead of the AUCell used in
single-cell data. Then, a threshold of 0.2 was defined for binarizing
the data, resulting in a matrix where, for each sample and regulon, we
know if it is enriched (1) or not (0). On the basis of these binarized
data and the cell type assignment for each sample, we calculated the
RSS using the calculate_rss function from the scFunctions r-package
(version 0.0.0.9000) ([298]https://github.com/FloWuenne/scFunctions/).
The output was a data frame with an RSS score for each regulon-cell
type combination. The RSS was computed twice for each regulon according
to TF-gene behavior, a positive correlation between TF and gene pointed
to an activator role and a negative correlation to a repressor role. A
heatmap/dot plot shows the TF expression of the regulon on a color
scale and cell type specificity (RSS) of the regulon on a size scale
([299]Fig. 2, A and B, and fig. S7A).
To explore the biological functions of the defined regulons, we
performed an ORA of the target genes of each regulon using the enricher
function of the clusterProfiler r-package (version 3.18.1) ([300]78).
The gene ratio and −log[10] from the adjusted P value under the
Benjamini-Hochberg method were computed using GO or Kyoto Encyclopedia
of Genes and Genomes databases (table S4). Top significant and highly
relevant biological pathways were selected for the summary circular
plot ([301]Fig. 2C) generated using the chordDiagram function from the
circlize r-package (version 0.4.15) ([302]98). For the comparison of
the pathway enrichment analysis for a list of gene sets, such as the
target genes of a regulon in each cell state ([303]Fig. 3, E and F, and
fig. S7, C and D), we used the compareCluster function from the
clusterProfiler r-package (version 4.10.0) ([304]78).
Topological property and network distance estimation and visualization
According to the structure of our GRNs, different network properties
were computed using the igraph r-package (version 1.3.5) ([305]99).
Node betweenness (the amount of influence a node has over the flow of
information in a graph through the shortest path calculation between
all pairs of nodes) and degree (the total number of edges adjacent to a
node) were evaluated for the complete set of GRNs with weighted edges
according to Spearman’s coefficient of correlation between TF and gene
([306]Fig. 2E). Moreover, Pearson correlation was computed to compare
the networks between them, and multidimensional scaling was used for
visualization ([307]Fig. 2D).
Single-cell samples (in-house dataset)
A total of five human bone marrow samples was collected in our
institution for the single-cell analysis after informed consent was
obtained. Four samples were CD34+ human bone marrow cells, processed by
a single-cell multiome, and one human bone marrow sample sorted for
hematopoietic lineage (sorting strategy: CD34+ CD38− CD3− CD64− CD56-
CD90+; fig. S1B) was divided to perform both single-cell RNA-seq
(scRNA-seq) and single-cell ATAC-seq (scATAC-seq) analyses.
Single-cell multiome data generation (in-house dataset)
The epigenomic landscape and gene expression of 10,000 individual
nuclei from the bone marrow were examined using Single Cell Multiome
ATAC + Gene Expression (10x Genomics) according to the manufacturer’s
instructions ([308]CG000365 RevC and [309]CG000338 RevF). Briefly,
100,000 CD34+ cells were lysed for 2 min, and then 16,100 nuclei were
transposed in bulk. Next, nuclei were partitioned in gel
bead-in-emulsions (GEMs) in the Chromium X instrument (10x Genomics),
where each nucleus was captured separately and its transposed DNA and
cDNA were uniquely indexed. Preamplified transposed DNA and cDNA were
then used to generate separate ATAC-seq and RNA-seq dual-indexed
libraries. Sequencing was performed in a NextSeq2000 (Illumina) at a
depth of at least 20,000 reads per nucleus for gene expression (Read1:
28; Read2: 90; i7 index: 10; i5 index: 10) and 25,000 reads per nucleus
for chromatin profiling (Read1: 50; Read2: 49; i7 index: 8; i5 index:
24).
scRNA-seq data generation (in-house dataset)
The transcriptome of the hematopoietic cells from bone marrow was
examined using NEXTGEM Single Cell 3’ Reagent Kits v2 (10x Genomics)
according to the manufacturer’s instructions. Between 17,600 and 20,000
cells were loaded at a concentration of 700 to 1000 cells/μl on a
Chromium Controller instrument (10x Genomics) to generate single-cell
GEMs. In this step, each cell was encapsulated with primers containing
a fixed Illumina Read 1 sequence, a cell-identifying 16-nt 10x barcode,
a 10-nt unique molecular identifier (UMI), and a poly-dT sequence. Upon
cell lysis, reverse transcription yielded full-length, barcoded cDNA.
This cDNA was then released from the GEMs, PCR amplified, and purified
with magnetic beads (SPRIselect, Beckman Coulter). Enzymatic
fragmentation and size selection were performed to optimize cDNA size
before library construction. Fragmented cDNA was then end repaired,
A-tailed, and ligated to Illumina adaptors. Final PCR amplification
with barcoded primers allowed sample indexing. Library quality control
and quantification were performed using a Qubit 3.0 Fluorometer (Life
Technologies) and Agilent’s 4200 TapeStation System (Agilent),
respectively. Sequencing was performed in a NextSeq500 (Illumina)
(Read1: 26 cycles; Read2:55 cycles; i7 index: 8 cycles) at an average
depth of 60,226 reads per cell.
scRNA-seq data analysis (in-house dataset)
The bcl (binary base call) sequence file format obtained on the 10x
Genomics Chromium sequencing platform was transformed into fastq format
using mkfastq from Cell Ranger 7.1.0 (for scRNA-seq) ([310]71) or
Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count
matrix used for downstream analysis was generated using the count
pipeline from Cell Ranger 7.1.0 (for scRNA-seq) or Cellranger-arc 2.0.2
(for multiome single-cell sequencing) with the GRCh38 reference
transcriptome provided by 10x Genomics. Following Seurat guidelines
(version 4.3.0.1) ([311]100), several cell and feature quality
filtering steps were applied to each sample. First, doublets were
removed using scDblFinder (version 1.12.0) ([312]101), and then cells
with >30% of mitochondrial percentage in single-cell multiome or >10%
in scRNA-seq were filtered out. Also, cells with <5% of ribosomal
percentage were filtered out in the scRNA-seq sample. Last, we filtered
out extreme outlier cells or features on the basis of lower (0.05) to
upper (0.95) bound quantiles for “number of features” and “counts per
cell.”
Normalization of the gene expression measurements for each cell was
done by multiplying the total expression by a scale factor (10,000) and
log-transforming it. Feature selection criteria have been used for
downstream analysis based on most variable genes using 2000 top genes.
Feature subspaces were obtained from most variable genes using PCA
using 20 components. Clustering was computed using the Louvain
algorithm over the principal components subspace with a resolution
value of 0.5. For data visualization, uniform manifold approximation
and projection (UMAP) was used, computing it over all PCA components
obtained. Then, a first annotation based on several hematopoietic
biomarkers for stem cells, lymphoid cells (T and B cells), and myeloid
cells (erythroid, monocytic, and granulocytic cells) was done. Those
cell clusters representing undifferentiated cells and lymphoid lineage
within the analyzed CD34+ cells were kept for forward analysis. Next,
to integrate the five single-cell transcriptional profiles, we merged
the samples using Seurat. Merged samples were normalized on the basis
of the gene expression measurements for each cell by multiplying the
total expression by a scale factor (10,000) and log-transforming it.
Regress-out of cell cycle was applied. Feature selection criteria were
used for downstream analysis based on most variable genes using 2000
top genes. Feature subspaces were obtained from most variable genes
using PCA using 20 components. To remove the influence of the dataset
of origin, we applied Harmony ([313]102). Clustering was computed using
the Louvain algorithm over the principal components subspace with a
resolution value of 0.8 based on the clustree approach (version 0.5.0)
([314]103). To annotate cell identity, we checked for B cell
biomarkers, cell cycle phase, and AUCell score (AUCell version 1.20.2)
([315]18) derived from our bulk signatures (fig. S8). We also computed
the AUCell score to assess the gene set activity of TF regulons at a
single-cell resolution and visualized the overall profile with a
heatmap representation scaled by rows ([316]Fig. 4D).
scATAC-seq data generation (in-house dataset)
The scATAC experiments were performed with the Chromium Next GEM Single
Cell ATAC kit version 1.1 from 10x Genomics following the
manufacturer’s instructions. For samples with 100,000 cells or less,
the nucleus isolation is done following the 10x Genomics–demonstrated
protocol for low cell input. Briefly, cells are collected in PBS and
0.04% BSA at 4°C; after that, cells are counted in a cell counter; and
centrifugation is done to concentrate them in 50 μl of PBS and 0.04%
BSA. After that, 45 μl of supernatant is removed, and 45 μl of chilled
and freshly prepared lysis buffer is added. After 5 min of incubation
on ice, 50 μl of wash buffer is added. After that, centrifugation is
done to remove 95 μl of the supernatant without disrupting the pellet.
Additional centrifugation is done with 45 μl of diluted nucleus buffer
to remove the supernatant. Isolated nuclei are resuspended in 7 μl of
chilled diluted nucleus buffer. Two of these microliters are used to
determine the cell concentration using a cell counter. The other 5 μl
is used immediately to generate a bulk transposition reaction following
the manufacturer’s instructions and conditions. In the transposase
reaction, the transposase enters the nuclei and preferentially
fragments the DNA in open regions of the chromatin. In this step,
adapter sequences are added to the ends of the DNA fragments. After
transposition, GEMs are generated in the Chromium controller (10x
Genomics). Upon GEM generation, the gel bead is dissolved.
Oligonucleotides containing an Illumina P5 sequence, a 16-nt 10x
barcode, and a Read1 sequence are released and mixed with DNA fragments
and master mix. After incubation, GEMs are broken, and pooled fractions
are recovered. P7, a sample index, and the Read2 sequence are added
during library construction via PCR. Library quality control and
quantification were performed using a Qubit 3.0 Fluorometer (Life
Technologies), Agilent’s 4200 TapeStation System, and Bioanalyzer 2100
(Agilent). Sequencing was performed using a NextSeq500 (Illumina)
(Read1: 50 cycles; i7 index: 8 cycles; i5 index: 16 cycles; Read2: 50
cycles) to an average depth of 14,900 fragments per cell.
scATAC-seq data analysis (in-house dataset)
The bcl file format obtained on the 10x Genomics Chromium sequencing
platform was, in this case, transformed into fastq format using mkfastq
from Cell Ranger ATAC 1.1.0 (for scRNA-seq) ([317]104) or
Cellranger-arc 2.0.2 (for multiome single-cell sequencing). The count
matrix used for downstream analysis was generated similarly using the
count pipeline from Cell Ranger ATAC 1.1.0 (for scATAC-seq) or
Cellranger-arc 2.0.2 (for multiome single-cell sequencing) with the
GRCh38 reference transcriptome provided by 10X Genomics.
To make bulk ATAC-seq and scATAC-seq comparable, a requantified version
of the scATAC-seq original count matrix was conducted to generate a new
feature × cell raw count matrix with the same OCRs from bulk as the
feature space using the FeatureMatrix function from the Signac package
([318]105). The new scATAC-seq matrix contains recalculated count
values based on scATAC-seq genomic fragments according to the bulk
ATAC-seq feature space. Then, we implemented quality control criteria,
retaining cells with a FRIP (fraction of reads in peaks) value of at
least 50% and a TSS enrichment score above 5. Cells with a
log-transformed number of unique fragments per barcode below 3.8 were
filtered out. Furthermore, doublet cells were identified and excluded
using the Scrublet algorithm with default parameters ([319]106).
Dimensionality reduction was conducted using latent dirichlet
allocation by performing Gibbs sampling for various topic counts (K).
The optimal K value was selected on the basis of the highest consensus
scores from four pycisTopic proposed metrics ([320]107).
Accessibility imputation was carried out using the impute_accessibility
function from the pycisTopic package ([321]17), with a scaling factor
of
[MATH: 106 :MATH]
to normalize accessibility scores across selected cells and regions.
Subsequently, gene activity scores were computed using the
get_gene_activity function from the pycisTopic package, incorporating
region-cell probabilities, distance weights, and upstream/downstream
search spaces ranging from 1 to 100 kb to refine gene activity
assignments. Default values of distance weights and other parameters,
such as gene boundaries and TSS extensions, were used to optimize the
accuracy of the activity predictions.
Label transfer was executed in two phases. First, annotations derived
from scRNA-seq were transferred to the multiome scATAC-seq (activity
scores) samples using annotation transfer methods including Harmony,
BBKNN, Scanorama, and canonical correlation analysis. The accuracy of
these transfers was evaluated against the annotations of the
corresponding cell labels in the scRNA-seq, with Harmony yielding the
highest accuracy across four multiome samples. In the second phase,
this transfer method was used to annotate the multiome samples and the
unpaired scATAC-seq sample. Last, samples were integrated using the
Harmony method ([322]102), facilitating the generation of a common
subspace for visualization.
scATAC-seq regulon analysis (in-house dataset)
To explore enhancer GRNs (eGRNs) from single-cell multiomic data, we
integrated the previously identified scATAC-seq data with guidelines
suggested by the SCENIC+ workflow ([323]17). In line with SCENIC+
recommendations, we identified candidate enhancer regions by binarizing
region-topic probabilities and analyzing DARs. The binarization process
used Otsu’s method, followed by dropout imputation by multiplying
cell-topic and topic-region distributions, as previously established by
cisTopic ([324]107). Normalization of the data was conducted using a
scale factor of 10^4, after which highly variable peaks were
identified. DARs were computed using the Wilcoxon rank-sum test,
maintaining the same ordered labeling of cell types that was
established in the prior scRNA-seq analysis.
To generate eGRNs, we merged scRNA-seq and scATAC-seq data to create a
pseudomultiome dataset. For each cell type annotation, we randomly
selected cells from both datasets. Meta-cells were constructed by
averaging scRNA-seq and scATAC-seq counts for each annotation with
K = 5 cells. Motif enrichment analysis was then performed using the
SCENIC+ pipeline, integrating binarized topics and DARs from
pycisTopic. We applied the cisTarget and DEM (differential enrichment
of motifs) algorithms for motif discovery.
cisTarget identified genomic regions from the SCREEN database (HG38)
using the Cluster-Buster tool ([325]108) and ranked motifs by motif
scores. A minimum overlap of 40% between input regions and reference
motifs was required. Recovery curve-based scoring and normalized
enrichment scores (NESs) were computed for each motif, retaining only
motifs with an NES >3. DEM performed a Wilcoxon test to assess motif
enrichment between foreground and background regions. Motifs with
adjusted P values <0.05 (Bonferroni corrected) and a log[2] fold change
>0.5 were selected.
Cistromes were generated by merging motif-based regions associated with
TFs, providing a comprehensive map of TF-target regulatory
interactions. Last, region-to-gene distances were computed, and
TF-target importances were evaluated using the gradient boosting
machine method, adhering to SCENIC+ guidelines.
scRNA-seq data analysis (public dataset, phs002371.v7.2)
We downloaded the FASTQ files for the phs002371.v7.p2 study from Human
Tumor Atlas Network (HTAN; phs002371.v7.p2) ([326]39). To generate the
count matrices needed in the downstream analysis, we used the count
pipeline from Cell Ranger 7.1. with the GRCh38 reference transcriptome
provided by 10x Genomics. Following Seurat guidelines (version 5.0.1)
([327]109), several cell and feature quality filtering steps were
applied to each sample. First, doublets were removed using scDblFinder
(version 1.12.0) ([328]101), and then following the authors’
preprocess, cells with >15% of mitochondrial percentage, cells with
fewer than 1500 UMIs or greater than 50,000 UMIs, and cells with
hemoglobin detection (>10 counts of hemoglobin-related genes) were
filtered out. Healthy samples (n = 15 samples from five donors) were
merged, and normalization of the gene expression measurements for each
cell was done by multiplying the total expression by a scale factor
(10,000) and log-transforming it. Feature selection criteria have been
used for downstream analysis based on most variable genes using 1500
top genes. Regress-out of counts and mitochondrial percentage was
applied. Feature subspaces were obtained from most variable genes using
PCA using 20 components. Clustering was computed using the Louvain
algorithm over the principal components subspace with a resolution
value of 0.5. For data visualization, UMAP was used, computing it over
all PCA components obtained. Then, a first annotation based on several
hematopoietic biomarkers for stem cells, lymphoid cells (T and B
cells), and myeloid cells (erythroid, monocytic, and granulocytic
cells) was done. Those cell clusters representing undifferentiated
cells and lymphoid lineage within the analyzed cells were kept for
forward analysis. Next, to integrate the 15 single-cell transcriptional
profiles of the subset of cells of interest, the merged samples were
normalized on the basis of the gene expression measurements for each
cell by multiplying the total expression by a scale factor (10,000) and
log-transforming it. Regress-out of cell cycle was applied. Feature
selection criteria were used for downstream analysis based on most
variable genes using 2000 top genes. Feature subspaces were obtained
from most variable genes using PCA using 20 components. To remove the
influence of the dataset of origin, we applied Harmony ([329]102).
Clustering was computed using the Louvain algorithm over the principal
components subspace with a resolution value of 0.8 based on the
clustree approach (version 0.5.0) ([330]103). To annotate cell
identity, we checked for cell biomarkers, cell cycle, and AUCell score
(AUCell version 1.20.2) ([331]18) derived from our bulk signatures
(fig. S8).
Public RNA-seq data from patients with BCP ALL (Li dataset)
Transcriptomic data from patients with BCP ALL (leukemic cells from
bone marrow aspirates) were obtained from the European Genome-phenome
Archive. The RNA-seq data (BAM files) from EGAD00001002151,
EGAD00001002692, and EGAD00001002704 datasets were downloaded after the
corresponding access request. Data from adult and pediatric individuals
belonging to a well-known molecular subgroup ([332]40), including MEF2D
fusions, TCF3-PBX1, ETV6-RUNX1/like, DUX4 fusions, ZNF384 fusions,
BCR-ABL1/Ph–like, hyperdiploidy, and KMT2A fusions were included in the
analysis, resulting in 469 patients ([333]Fig. 5A).
As BAM files were mapped to GRCh37, we first obtained the Fastq files
using the bamtofastq function from bedtools (version 2.25.0) ([334]86).
Next, the read quality was assessed with FastQC software (version
0.11.7) (Babraham Institute). Low-quality, short-read, and TruSeq3-SE
adapters were trimmed using Trimmomatic (version 2.6.0a) ([335]73)
under the following parameters: ILLUMINACLIP:TruSeq3-SE:2:30:10
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75. Reads were then
mapped to the December 2013 H. sapiens high-coverage assembly GRCh38
from the Genome Reference Consortium GRCh38 (hg38) using STAR (version
2.6.0a) ([336]74) and counted for each gene using htseq-count ([337]75)
with the –m intersection-nonempty option and a GTF file from GRCh38
Ensembl annotation release 96, getting the information for a total of
58,884 genes. Genes with a minimum read count of 5 in all samples were
retained (32,416 genes). Next, batch correction was done using the
ComBat_seq function from the sva r-package (version 3.38.0) ([338]91,
[339]110). Counts were normalized under TMM normalization and voom
transformation from the edgeR r-package (version 4.0.2) ([340]76). PCA
was used for data exploration (fig. S10, A to C). Differential
expression analysis of leukemic subtypes was performed using limma.
Genes with an adjusted P value ≤0.01 (Benjamini-Hochberg) and fold
change ≥1.5 in at least six contrasts were defined as significant to
get the B-ALL subtypes biomarkers. The B-ALL subtype biomarkers list
can be found in table S12.
ORA in public BCP ALL data (Li dataset)
The ORA of the gene sets from B-ALL subtype signatures (fig. S9, D and
E, and table S12) was performed in the signatures of the different B
cell populations (fig. S4) using the enricher function of the
clusterProfiler r-package (version 3.18.1) ([341]Fig. 5C) ([342]78).
The gene ratio and −log[10] from the adjusted P value under the
Benjamini-Hochberg method were computed. The ORA was also applied to
explore the enrichment of 711 binding sites defined as the
highest-affinity binding sites for the ETV6-RUNX1 fusion under ChIP-seq
analysis ([343]41) in the consensus OCRs for each cell subpopulation
([344]Fig. 5D). Last, we applied the ORA to explore whether there was a
significant enrichment of the regulons identified along B cell
differentiation in the different B-ALL subtype signatures derived from
the Li dataset ([345]Fig. 5E).
Public transcriptomic data from patients with BCP ALL
Transcriptomic data from patients with BCP ALL (leukemic cells from
bone marrow aspirates) were obtained from different public datasets to
validate the results found under the Li dataset approach. Four
different datasets were obtained from Gene Expression Omnibus (GEO):
Rainer dataset ([346]GSE73578) ([347]42), microarray-based whole-genome
expression profiling of lymphoblasts from 46 patients; Iacobucci
dataset ([348]GSE13204) ([349]43), microarray gene expression profiling
of 144 adult patients with B-ALL and who are B-ALL negative for known
molecular rearrangements; Singh dataset ([350]GSE153670) ([351]68),
RNA-seq data from bone marrow samples taken from patients enrolled on
ENSIGN or ELIANA clinical trials before infusion of CART19 (n = 27);
Zinngrebe dataset ([352]GSE140556) ([353]69), microarray data from
eight SMAC mimetic monotherapy (SM)–sensitive and six SM–insensitive
BCP ALL samples. Microarray processed data downloaded from GEO were
used for the computation of GSVA score for each dataset. In the case of
the Singh dataset, the downloaded raw counts from RNA-seq were
filtered, defining a minimum read count of 5 in all samples per gene
and normalized under TMM normalization and voom transformation from the
edgeR r-package (version 4.0.2) ([354]76). PCA was used for data
exploration (fig. S10, F to I).
Public ChIP-seq histone marks
ChIP-seq data from the ENCODE project ([355]www.encodeproject.org/) for
H3K4me1, H3K4me3, and H3K27ac on B cells, naïve B cells, and common
CD34+ myeloid progenitors were downloaded (table S1). To get the
consensus regions for each cell subpopulation and histone mark, we
found the genome regions that are covered by at least n − 1 of the sets
of peaks (n being the total number of samples for a specific cell
subpopulation), and then we merged the nearby peaks with a minimum gap
width of 31. To get the consensus regions for a histone mark in the
whole population, we get the regions in the genome where the coverage
is at least in 1 of the consensus histone marks from the different cell
subpopulations and then merged the nearby peaks with a minimum gap
width of 31, resulting in 166,295 H3K4me1 regions, 37,952 H3K4me3
regions, and 159,079 H3K27ac regions. These processes were done using
the rtracklayer (version 1.58.0) and GenomicRanges (version 1.50.2)
r-packages ([356]83, [357]84). Then, the overlap between our OCRs and
histone marks was assessed using the findOverlaps function
(minoverlap = 6) of GenomicRanges (fig. S13A).
Public Hi-C data
Hi-C public data of naïve B cells were obtained from EGAD00001006486
([358]51). The valid reads of Hi-C experiments were processed with
TADbit version 1.1 ([359]111). Hi-C data were normalized using the oned
function of the dryhic r-package (version 0.0.0.9100) ([360]112) at a
100-kb resolution to remove known experimental biases, a method that
controls for the presence of abnormal karyotypes in cancer samples.
Topologically associating domains were detected at a 40-kbp resolution
using the TopDom algorithm ([361]113). The significant Hi-C
interactions were called with the Fit-Hi-C tool (version 1.1.3)
([362]114), binned at a 10-kb resolution and with the default
thresholds: P value <0.001 and q-value <0.001.
To study the compartmentalization of the genome in naïve B cells, we
used merged biological replicates, resulting in interaction Hi-C maps
with around 300 million valid reads each. Briefly, normalized
chromosome-wide interaction matrices at a 100-kb resolution were
transformed into Pearson correlation matrices. These correlation
matrices were then used for PCA, for which the first eigenvector (EV)
normally delineates genome segregation. The multimodal distribution of
the EV coefficients from the B cell dataset was modeled as a Gaussian
mixture with three components (k = 3). To estimate the mixture
distribution parameters, an expectation-maximization algorithm using
the normalmixEM function from the mixtools r-package (version 1.2.0)
was applied ([363]115). A Bayesian information criterion was computed
for the specified mixture models of clusters (from 1 to 10) using the
mclustBIC function from the mclust r-package (version 5.4.9) ([364]77).
Three underlying structures were defined: alternative
compartmentalization into A-type (with the most positive EV values),
B-type (with the most negative EV values), and I-type (an
intermediate-valued region with a different distribution) compartments.
Two intersection values were defined at the intersection points between
two compartments and used as standard thresholds to categorize the data
into the three different compartments; that is, the A-type compartment
was defined for EV values >=+0.1388, the I-type compartment was defined
for EV values between +0.1388 and −0.01955, and the B-type compartment
was defined for EV values between <=−0.01955.
Then, we checked the overlap between our OCRs and the compartments
identified using the findOverlaps function (default parameters) of the
GenomicRanges r-package (version 1.50.2) (fig. S13B) ([365]84). When an
OCR overlaps with two compartments, we assigned it to the compartment
with a higher overlap.
Public promoter capture Hi-C interaction data
Detected interactions using the promoter capture Hi-C technology from
Javierre et al. ([366]52) were downloaded. From the promoter capture
Hi-C peak matrix, we select all those interactions with a CHiCAGO score
≥5 in the naive B cell population, resulting in 197,442 interactions.
The coordinates of the interacting regions were translated from GRCh37
to GRCh38 using the liftOver r-package (version 1.18.0). Then, we check
the overlap between our interactions (OCR-gene associations) and their
interactions using the findOverlaps function (minoverlap = 6) of the
GenomicRanges r-package (version 1.50.2) (fig. S13C showing some of the
overlapping interactions) ([367]84).
Web application
The Shiny r-package (version 1.8.0) ([368]116) was used to generate a
user-friendly web interface for browsing called B-rex
([369]https://translationalbio.shinyapps.io/brex/). For scRNA-seq, the
ShinyCell template (version 2.1.0) was used ([370]117). The Shiny app
is hosted on [371]https://www.shinyapps.io/.
Acknowledgments