Abstract
The execution of biological activities inside space‐limited cell nuclei
requires sophisticated organization. Current studies on the 3D genome
focus on chromatin interactions and local structures, e.g.,
topologically associating domains (TADs). In this study, two global
physical properties: DNA density and distance to nuclear periphery
(DisTP), are introduced and a 2D matrix, D^2 plot, is constructed for
mapping genetic and epigenetic markers. Distinct patterns of functional
markers on the D^2 plot, indicating its ability to compartmentalize
functional genome regions, are observed. Furthermore, enrichments of
transcription‐related markers are concatenated into a cross‐species
transcriptional activation model, where the nucleus is divided into
four areas: active, intermediate, repress and histone, and repress and
repeat. Based on the trajectories of the genomic regions on D^2 plot,
the constantly active and newly activated genes are successfully
identified during olfactory sensory neuron maturation. The analysis
reveals that the D^2 plot effectively categorizes functional regions
and provides a universal and transcription‐related measurement for the
3D genome.
Keywords: 3D genome, DNA density, nuclear lamina, single‐cell Hi‐C
__________________________________________________________________
The D^2 algorithm is developed to detect two physical properties, DNA
density and distance to nuclear periphery, from the reconstructed
genome structures. Based on them, D^2 plot is constructed to reveal the
spatial positions of genetic and epigenetic markers. Furthermore, a
cross‐species transcriptional activation model on D^2 plot is
summarized, which is able to reflect expression levels.
graphic file with name ADVS-9-2202149-g005.jpg
1. Introduction
DNA molecules are highly condensed inside cell nuclei to compact into
limited space, with the necessary loci exposed for active
transcription.^[ [36]^1 ^] The rapid development of 3C‐based
technologies offers insight into the sophisticated organization of the
genome, allowing for the discovery of functional nuclear structures,
for example, TADs, loops, and compartments.^[ [37]^2 ^] Nevertheless,
most of these interaction‐derived structures focus on local spatial
organizations, lacking a universal ability to compare distal regions.
Although compartments are genome‐wise structures, they fail to
correlate a given region to a specific biological function without
additional functional data.^[ [38]^2 ^] While some researchers use
interdisciplinary approaches to delineate the chromatin structure,^[
[39]^3 , [40]^4 , [41]^5 ^] their methods continue to focus on
interactions. Therefore, universal measurement for a 3D genome with
functional correlation is required to elucidate the structural basis of
genome function.
DNA density (hereafter referred to as density) and distance to nuclear
periphery (DisTP) are two primary physical properties that are related
to transcriptional activities in genome spatial organization. A
negative correlation between density and transcriptional activity has
been implied by several nuclear structures in previous studies.^[
[42]^6 , [43]^7 , [44]^8 ^] Specifically, the densely packed
heterochromatin is less accessible and transcriptionally repressed,^[
[45]^6 , [46]^9 ^] whereas euchromatin, which is less condensed,
exhibits more active transcription.^[ [47]^6 ^] Several studies have
shown that DisTP is positively correlated with transcriptional
activity.^[ [48]^9 , [49]^10 , [50]^11 , [51]^12 ^] The
lamina‐associated domains (LADs), located at the nuclear periphery, are
believed to repress gene expression.^[ [52]^9 ^] Experimental evidence
has reinforced the idea that reduced proximity to the nuclear lamina is
usually accompanied with higher transcriptional activity, while
increased proximity to the nuclear lamina results in more repressed
transcription.^[ [53]^13 , [54]^14 ^] Previous works have clustered the
genomic segments into groups with different DisTP (1 Mb resolution),
either according to the results of experimentation^[ [55]^11 ^] or
those of simulation.^[ [56]^12 ^] By computing the correlation with
genetic markers, such as histone modifications, they found that the
inner groups were more likely to enrich for active markers compared
with the peripheral groups.
Owing to the lack of appropriate techniques for detecting the physical
properties in genome‐wide scale with high resolution, the extent to
which density or DisTP are connected to transcriptional activities, and
the analysis comprising the properties for the comprehensive
description of genome structure and function are unclear. We developed
an algorithm, D^2, to compute the two properties: density and DisTP,
and integrated them as a 2D matrix, D^2 plot, for mapping functional
markers at the genome‐wide scale. We identified distinct enrichment
patterns of functional markers on the D^2 plot and proposed a
cross‐species transcriptional activation model on D^2 plot. Our study
revealed a novel analysis procedure for correlating nuclear
organization with gene transcriptional activities and provided a new
perspective for the study of the 3D genome structure.
2. Results
2.1. Detection of DNA Density and DisTP
D^2 algorithm computed Density and DisTP at the genome‐wide scale in
high resolution (Figure [57]1A; the Experimental Section). The input of
the D^2 algorithm was a set of DNA particles, which were reconstructed
from the single‐cell Hi‐C contact map by Dip‐C^[ [58]^15 ^] or
NucDynamics^[ [59]^16 ^] (Table [60]S1, Supporting Information; the
Experimental Section). Each particle had its own spatial coordinate and
genomic position, thus linking spatial structure to genomic
sequence. First, we introduced a 3D mesh segmentation to assign the
input DNA particles into different cubes. The cube length was
adjusted so that the average number of particles for each cube was
approximately equal among cells, thereby allowing for intercell
comparison. Density was calculated as the number of particles in one
cube and then smoothed to reduce noise. To obtain DisTP, we first
identified the membrane cubes (cubes located at the nuclear periphery),
and then calculated DisTP for each cube as its distance to the nearest
membrane cubes (Figure [61]S1, Supporting Information). Both properties
were consistent across different reconstructed replicates of a given
individual cell (Figure [62]S2, Supporting Information), indicating
that D^2 was robust against the randomness introduced by the structure
reconstruction algorithm. Therefore, we selected the first
reconstruction of each cell for the following analysis. Serial sections
of density and DisTP of a sample cell were illustrated in
Figure [63]1B.
Figure 1.
Figure 1
[64]Open in a new tab
D^2 detects DNA density and DisTP at a genome‐wide scale.
A) Illustration of the D^2 algorithm. Each yellow dot from the
reconstructed structure represents a 20 kb genomic segment. The line
threading yellow dots indicates they are part of a given chromosome.
The light gray mesh is the 3D segmentation, which is shown in 2D for
clear visualization. In the “define membrane cubes” panel, the gray
cubes are regarded as the membrane cubes as they are not buried. The
intensity of red for each cube in the right top subpanels indicates its
density value, while the intensity of yellow in the right bottom
subpanel indicates its DisTP value. B) Serial sections of a sample
nucleus (GM12878 cell 1) showing density (red) and DisTP (yellow). Only
seven sections in the middle are shown. Each section is 8% of the
entire width. The white patches on the sections denote the empty or
almost empty regions inside the nuclei.
We then validated the results of D^2 with experimental evidence. LADs
were the conspicuous nucleic structures that were mainly located at the
nuclear periphery. We found a significantly high correlation between
the probability of appearing at the nuclear periphery (low DisTP) and
the experimental LAD data, thereby verifying the fitness of DisTP
(Figure [65]2A; Pearson coefficient correlation (PCC): 72.87%, p‐value:
4.8 × 10^−213). Rods of nocturnal animals adopted an inverted radial
organization, in which euchromatin resided at periphery and
heterochromatin preferred the interior.^[ [66]^17 ^] Inverted
distributions between the computed DisTP of rods and that of other cell
types were observed (Figure [67]2B), indicating that our algorithm is
able to capture diverse configurations of DisTP from dramatically
different genome structures. Owing to the lack of experimental
measurement for density, we attempted to validate this through highly
compacted chromocenters located at the inner nuclei of mouse cells but
not in human cells (Figure [68]S3, Supporting Information).^[ [69]^18 ,
[70]^19 ^] We found that high‐density regions preferentially located at
the inner nuclei of mouse cells but this was not the case in human
cells (Figure [71]2C), indicating the feasibility of density detection.
Figure 2.
Figure 2
[72]Open in a new tab
Validation of D^2 algorithm. A) Scatter plot of the experimental LAD
data versus the probability of periphery regions from the D^2
algorithm. Each point denotes a 2 Mb DNA segment of the mESC cell. The
correlation and significance are computed by Pearson correlation (n =
1103). The black line shows the fitting line of linear regression. B)
Correlations of computed DisTP among different cell types. Each scatter
plot includes 50 000 randomly sampled 20 kb genomic segments. Detailed
analysis of adult rods is shown in the Supporting Information.
C) Percentage of high‐density regions for a set of given DisTP. Each
line denotes a cell type. The top 5% of dense regions are regarded as
the high‐density regions.
We then investigated the correlation between density and DisTP. In
human cells, density and DisTP were largely independent of each other
(PCC: −0.019 and −0.103; Figure [73] 3A,B; Table [74]S2, Supporting
Information). By contrast, they were positively correlated in mouse
cells (PCC: 0.278 to 0.673, Figure [75]3A,B; Table [76]S2, Supporting
Information). This interspecies difference indicated that human and
mouse cells adapted different strategies for nuclear spatial
organization.
Figure 3.
Figure 3
[77]Open in a new tab
Strong intrinsic stochasticity of density and DisTP. A) Scatter and
kernel density estimate (KDE) plots of density and DisTP. The gray line
delineates the fitting line of linear regression. The correlations are
computed by Pearson correlation. Detailed information for the
additional 15 cell types is shown in Table [78]S2 of the Supporting
Information. B) Boxplot of PCC between density and DisTP for human and
mouse cells. n indicates the number of cell types. C) 2D histogram and
KDE plots of mean and SD for DisTP. One human and two mouse cell lines
are shown here, and additional 14 cell types are shown in Figure [79]S5
of the Supporting Information. D) Similar to (C) but for density.
Next, we explored the stochasticity of density and DisTP by computing
their standard deviations (SDs; Figure [80]S4, Supporting Information;
the Experimental Section). SDs of density (≈1.5) and DisTP (≈4.7) were
comparable to their means (≈2.2 and ≈6.8, Table [81]S3, Supporting
Information), indicating a strong intrinsic stochasticity that was
consistent with previous observations.^[ [82]^20 ^] In addition,
certain regions with specific density or DisTP had lower standard
deviation (SD). Specifically, for DisTP, the periphery regions showed
lower SD for both human and mouse cells (Figure [83]3C; Figure [84]S5A,
Supporting Information), thereby coinciding with their roles as anchors
for attachment to the nucleus membrane. For density, the lower SD
regions exhibited noticeable cell‐type specificity. For most cell types
(11 out of 17), high‐density regions were lower in SD, while for the
other six mouse neuron cell types, low‐density regions had lower SD
(Figure [85]3; Figures [86]S5B and [87]S6, Supporting Information),
suggesting a cell‐type‐specific anchoring mechanism for density.
2.2. Distinct Patterns of Functional Markers on D^2 Plot Revealed Functional
Genomic Regions
The spatial organization of eukaryotic cell nuclei plays a pivotal role
in regulating the biological activities of DNA molecules.^[ [88]^1 ,
[89]^21 , [90]^22 ^] However, a comprehensive and systematic analysis
of the correlation of density and DisTP with chromatin activities was
lacking. To achieve that, we collected different types of
epigenomic and genomic data from human GM12878 cells and mouse neonatal
forebrain neurons, including DNA methylations, histone modifications,
transcription factor binding sites, genomic repeats, and chromatin
states.
For joint analysis, we constructed a 2D matrix of density (x‐axis) and
DisTP (y‐axis), named D^2 plot (the Experimental Section). D^2 plot was
composed of physical states (the Experimental Section), the rectangular
bins with specific values of density and DisTP (Figure [91]4A; Table
[92]S4, Supporting Information). We only kept the states with enough
genomic segments for statistical significance. Genetic markers were
then mapped to the D^2 plot according to their enrichments on genomic
segments. Enrichment scores of each marker at each physical state were
calculated, compared to the whole genome as the background (the
Experimental Section). The enrichment scores of all markers are shown
in Figure [93]4 and the figures in the Supporting Information.
Figure 4.
Figure 4
[94]Open in a new tab
Enrichment scores of genetic and epigenetic markers on D^2 plot. A) 2D
histogram and KDE plots of density and DisTP in human GM12878 cells and
mouse neonatal forebrain neurons. The top and right subpanels show the
1D histograms of density (top) and DisTP (right). Yellow lines mark the
boundaries to exclude outliers from downstream analysis. The boundaries
of 16 cell types are shown in Table [95]S4 of the Supporting
Information. B) Enrichment scores of H3K4me3 marker of human and mouse
cells on D^2 plot. The enrichment score for the bottom right physical
state in mouse cells is not shown owing to its small number (n < 200)
of genomic segments (Methods section, Supporting Information).
C) Similar to (B) but with markers H3K27me3 and H3K9me3. D) Enrichment
scores of Alu and L1 for human GM12878 cell line. The results for mouse
cell lines are shown in Figure [96]S11 of the Supporting Information.
The right panel illustrates the competitive relationship between Alu
and L1 in both human and mouse cells. E) Enrichment scores of YY1 and
RAD21 for human GM12878 cell line. The enrichment scores of CTCF and
SMC3 are shown in Figure [97]S13 of the Supporting Information. The
right panel illustrates the mostly cooperative but occasionally
independent relationships among CTCF, YY1, and cohesin. Details on the
process of computing enrichment scores are presented in Methods section
of the Supporting Information.
We first examined the distribution of active markers on D^2 plot. We
found that H3K4me3, a well‐established marker for transcriptional
activation,^[ [98]^23 ^] enriched at the center rather than at the
periphery of the nuclei, which was consistent with previous microscopic
observations (Figure [99]4B; Figure [100]S7, Supporting Information).^[
[101]^7 ^] Regions with enriched H3K4me3 distribution showed lower
density (Figure [102]4), which confirmed that transcriptional
regulators around active genes lowered the local DNA density.^[ [103]^6
^] Other active histone markers and chromatin states, and even the
poised or weakly transcribed chromatin states exhibited similar
patterns (Figure [104]S8, Supporting Information). The coexistence of
active and weak transcriptional markers implied that the inner
low‐density areas were not a determining factor for transcriptional
activity, but rather provided a hospitable environment, allowing for
the recruitment of regulating proteins for transcriptional activation,
and thereby reinforcing the previously proposed “permissive role” of
spatial organization.^[ [105]^24 ^]
Next, we focused on the facultative and constitutive heterochromatins
with repressed transcription, as marked by the trimethylation of lysine
27 on histone H3 (H3K27me3) and trimethylation of lysine 9 on histone
H3 (H3K9me3), respectively.^[ [106]^25 ^] On the D^2 plot, H3K27me3 was
more likely to occupy the condensed regions inside the nuclei, while
H3K9me3 showed an obvious preference for nuclear periphery
(Figure [107]4C), as confirmed by the immunofluorescence experiments
(Figures [108]S9 and [109]S10, Supporting Information).^[ [110]^26 ,
[111]^27 ^] We also found that H3K9me3 enriched at inner high‐density
regions in mouse cells but not in human cells, suggesting the
species‐specific features of heterochromatin arrangements.^[ [112]^18 ,
[113]^19 ^]
We then investigated the repetitive elements on the D^2
plot. We found that long repeats, including long terminal repeats and
long interspersed nuclear elements (for example, L1), preferred the
inner high‐density regions or periphery regions (Figure [114]4D; Figure
[115]S11, Supporting Information). Contrarily, the short interspersed
nuclear elements (including Alu, B2, and ID) preferably enriched at
inner low‐density regions (Figure [116]4; Figure [117]S10, Supporting
Information), which was consistent with the FISH analysis performed by
Lu et al., who found that Alu clustered at the nuclear interior, while
L1 clustered at the nuclear periphery.^[ [118]^28 ^] The distributions
of Alu and L1 were significantly negatively correlated in both species
(Figure [119]S12, Supporting Information), indicating a competitive
relationship between the two transposons in spatial organization.
Next, we examined the distribution of the architectural proteins,
including RAD21, SMC3 (both subunits of cohesin), CTCF, and YY1. They
are believed to cooperate in the construction of genomic structures via
the use of a previously proposed loop extrusion model.^[ [120]^29 ^]
The patterns of these proteins largely overlapped on the D^2 plot at
the inner region of the nuclei (Figure [121]4E; Figure [122]S13,
Supporting Information). Nonetheless, YY1 and CTCF were more likely to
enrich at the lower‐density regions than cohesin, indicating that YY1
and CTCF preferred to locate at the active compartment. Consistently,
the fold changes of YY1 (1.71) and CTCF (1.23) enrichments in
compartment A to B were higher than those of RAD21 (1.13) and SMC3
(1.15) (Figure [123]S14, Supporting Information). The differences in
enrichments implied that, in addition to working together, these
proteins might function individually, as previously observed by Wei
et al.^[ [124]^30 ^]
2.3. Cross‐Species Transcriptional Activation Model on D^2 Plot
To link transcription activity to the D^2 plot, we selected 16
transcription‐related markers and performed the hierarchy cluster on
physical states (the Experimental Section). As shown in Figure [125]5A,
we successfully distinguished the active states from the repressed
ones. However, there were still intermediate states marked by both
active markers and repressive histone modifications, corresponding to
the “bivalent” genomic regions where promoters were prepared for rapid
activation during differentiation.^[ [126]^31 ^]
Figure 5.
Figure 5
[127]Open in a new tab
Cross‐species transcriptional activation model on D^2 plot.
A) Hierarchy clustering of D^2 plot physical states (n = 225 for human,
n = 224 for mouse due to one state with fewer than 200 genomic
segments) based on the 16 transcription related markers. Using the
optimal ranking hierarchy cluster, the states are ranked automatically
according to their activation levels. The number of clusters is set as
4 for both species. The black dashed lines mark the boundaries of the
clusters. The abbreviations have been spelled out in the legends of
Figures [128]S7 and [129]S10 of the Supporting Information. B) The four
areas on each D^2 plot for either human GM12878 cells or mouse neonatal
neurons, corresponding to (A). C) Illustration of the transcriptional
activation model on the D^2 plot. The arrows point to more active
regions. D) The enrichment scores of newly activated genes on the D^2
plots of oligodendrocyte progenitors and mature oligodendrocytes. The
enrichment scores of blank physical states are not shown owing to the
small number (n < 200) of genomic bins. E) Similar to (D) but with
newly repressed genes.
Based on the hierarchy clusters for both human and mouse cells, we
divided the physical states into four groups: active, intermediate,
repress and histone (repressed states marked by repressed histone
modifications), and repress and repeat (repressed states marked by long
repeats) areas. Then we assigned these four areas back to the D^2 plot
to reveal their transcriptional patterns (Figure [130]5B). The patterns
between human GM12878 cells and mouse neonatal forebrain neurons showed
slight differences. For example, by computing the property strength
(Figure [131]S15, Supporting Information; the Experimental Section), we
found that DisTP had less impact on the nucleic interior in human cells
compared to mouse cells, probably owing to the different strategies of
heterochromatin organization between the two species.^[ [132]^18 ^]
Other than the marginal difference, the patterns were highly similar,
out of which we proposed a cross‐species transcriptional activation
model on the D^2 plot (Figure [133]5C). The model showed that the most
inactive area, repress and repetitive, located at the periphery
regardless of density. At the inner nucleus, however, density and DisTP
jointly affected gene regulation. From inactive to active areas
(repress and histone, intermediate to active area), density decreased
while DisTP increased. This model is largely consistent with the
current understanding of density and DisTP, but provides additional
functional annotation.
To verify the proposed model, we projected the lineage‐specific active
genes of mouse brain cells onto the D^2 plot and found that those
active genes located at more interior regions with lower density in the
corresponding cell types compared with other cell types (Figure
[134]S16, Supporting Information). Next, we focused on the genes that
were previously defined as either newly activated or repressed during
maturation from oligodendrocyte progenitors to mature
oligodendrocytes.^[ [135]^32 ^] As expected, the newly activated genes
distributed almost randomly on the D^2 plot of progenitors, but
enriched at inner low‐density regions after the maturation
(Figure [136]5D), corresponding to the active and intermediate areas.
Conversely, the newly repressed genes located at inner low‐density
regions in progenitors but distributed randomly in mature cells
(Figure [137]5E).
Tan et al. found no explanation for the adult‐neuron‐specific genes
maintaining the same radial position during differentiation, despite
the newly activation of these genes.^[ [138]^32 ^] We revealed that
DisTP values of these genes remained unchanged (Fisher Test p‐value:
0.68), but density values declined significantly (Fisher Test p‐value:
3.34 × 10^−148; Figure [139]S17, Supporting Information), demonstrating
that although DisTP alone was not sufficient for predicting gene
regulation, its combination with DNA density was instructive.
2.4. Trajectories on D^2 Plots Revealed Transcriptional Activation Modes
After establishing the correlation between the transcriptional level
and the D^2 plot, we aimed to identify the transcriptional patterns of
genomic regions merely by their trajectories on the D^2 plot. To do
this, the activation index was introduced as the probability of a given
genomic region appearing at the active or intermediate areas, and
changes in its value reflected the trajectories on the D^2 plot during
cell maturation (Figure [140]S18, Supporting Information; the
Experimental Section). We examined our approach on olfactory sensory
neurons (OSNs), which included three cell types at different
differentiating stages (newborn, developing, and mature OSNs). By
investigating the trajectories of the whole genome on the D^2 plot, we
identified three transcriptional modes: constantly active (CA),
repressed to active (RA), and constantly repressed (CR), which revealed
the dynamic regulation of gene expression during OSN development.
Next, we investigated the enrichments of functional elements in the
three modes. Compared to the background, the percentages of exons in CA
and RA were remarkably higher, while that of CR was lower (Figure
[141]6A), in accordance with their expression levels. In stark
contrast, the percentages of L1 in CA and RA were lower while that of
CR was higher. As expected, CA was enriched with foundational gene
pathways (ribosome, translation, and lipid transport), whereas RA was
enriched with olfactory receptor (OR)‐related pathways
(Figure [142]6B). No significantly enriched pathways (FDR < 0.01) were
identified for CR.
Figure 6.
Figure 6
[143]Open in a new tab
Trajectories on D^2 plots reveal the transcriptional activation modes.
A) Boxplots for percentages of exon (left) and L1 (right) in three
identified modes, namely constantly active (CA), repressed to active
(RA), and constantly repressed (CR). For background datasets, 20 000
bins from the whole genome are randomly selected. B) Enriched GO terms
and KEGG pathways of the three modes. Only the top five significant
terms (FDR < 0.05) are plotted. No significant term is found for the
CR, therefore it is not shown. C) Schematic of a 3 Mb genomic segment
of paternal haplotype containing 19 HK genes, 10 ORs, and 129 L1
repeats. The colored segments mark the locations of these functional
elements. D) The top three heatmaps show the activation indexes of
three OSN cell types. The three identified modes are shown beneath
them. The gray bins are not classified into the three modes. The bottom
three bar plots denote the HK gene density, L1 density, and OR density
at the 3 Mb genomic region. Each bin at the heatmap or bar plot
indicates a 20 kb genomic segment. E) The trajectories of the sample
genomic segments on the D^2 plot. The trajectories for other genomic
segments are shown in Figure [144]S20 of the Supporting Information.
The black arrows show the directions of trajectories for ORs.
We then spotted an ≈3 Mb genomic segment (chromosome 10: 126.16–129.28
Mb), harboring 19 housekeeping (HK) genes (mainly located at CA), 10
ORs (at RA), and 129 L1 repeats (at CR) (Figure [145]6C). The
activation indexes of HK genes stabilized at high levels across the
three cell types, while those of L1 repeats remained at low levels
(Figure [146]6D). On the D^2 plot, the housekeeping genes persistently
enriched at the inner low‐density regions, while the L1 repeats
enriched at the periphery (Figure [147]6E; Figure [148]S19, Supporting
Information), corresponding to their constitutive activation or
repression, respectively.
ORs were a large gene family that expressed at mature OSNs for sensing
odors.^[ [149]^33 ^] Correspondingly, their activation indexes
increased during development (Figure [150]6D), and they moved on the
D^2 plot from the periphery (newborn OSNs), to inner high‐density
(developing OSNs), and then toward lower density (mature OSNs), in
consonance with their activation process (Figure [151]6E; Figure
[152]S18, Supporting Information). Additionally, for each individual
OSN, only one specific OR expressed while all the others were
repressed,^[ [153]^33 ^] which would leave featured distributions on
D^2 plot. Our D^2 plot demonstrated that ORs collectively enriched
inside nuclei with varying densities, corresponding to the active,
intermediate, and repress and histone areas (Figure [154]S20,
Supporting Information), and reflecting their diverse levels of
activation at the bulk level.
3. Conclusion
Inside the cell nuclei, the genome is compartmentalized into regions
with different physical properties, to effectively execute their
biological functions. Based on the reconstructed genome structures from
single‐cell Hi‐C, we developed the D^2 algorithm to define density and
DisTP at a genome‐wide scale and integrated them into a two‐dimensional
D^2 plot. Different genetic and epigenetic markers exhibited distinct
enrichment patterns on the D^2 plot, indicating that these two physical
properties, density and DisTP, were two pivotal parameters for
identifying the functional compartmentalization of the genome
structure. For instance, transcription‐related enzymes and active
genomic sites altogether clustered at inner low‐density regions,
whereas repressed heterochromatin markers enriched at inner
high‐density or periphery regions.
In this study, we investigated the enrichments of architectural
proteins (CTCF, YY1, and cohesin) on the D^2 plot and revealed their
largely overlapping enrichment zones, correlating to their cooperative
functions. In addition, we spotted cohesin‐independent enrichments of
YY1 and CTCF at inner low‐density regions, hinting their independent
functions. Furthermore, all the architectural proteins located inside
the nuclei, suggesting alternative mechanisms for genome construction
at the nuclear periphery. L1 repeats, which were previously reported to
promote heterochromatin compartmentalization,^[ [155]^28 ^] were found
to mainly locate at the nucleic periphery, implying their possible
involvement in maintaining genome construction.
In this study, we proposed a cross‐species transcriptional activation
model on the D^2 plot among different human and mouse cells. This model
revealed that the distance to the periphery was negatively correlated
with the transcriptional activities of genes, which seemingly
contrasted to the center‐positioned nucleoli that were mainly composed
of heterochromatin. The significantly low genome coverage (<5%) of
nucleolus‐associated domains might render these regions neglected in
our model.^[ [156]^34 ^] To examine the fitness of our model to reflect
gene activation, mouse olfactory neurons at three differentiating
stages were analyzed, and our model successfully identified the
activated genes based on the change of density and DisTP. However, the
underlying mechanism has not been explored. We assume that the
translocation of the selected genes between distal regions with
different density and DisTP is unfavorable, owing to the high cost of
energy. Alternatively, the changes of local chromatin modifications,
such as histone modification, assembling into distinct condensates
through phase separation, and simultaneously altering surrounding
density or DisTP, is feasible.
Notably, the D^2 model of mouse cells, albeit being derived from mouse
forebrain neurons, is capable of linking the structure with gene
expression during OSN development, verifying its broad fitness and
feasibility. Nonetheless, further verification and application of this
model in other species is required, but these are currently unavailable
owing to the lack of high‐quality single‐cell Hi‐C data. Our model
would be applicable to most species, where sophisticated Mb‐scale
spatial organization is essential for regulating gene expression. It
might be helpful to elucidate human tumorigenesis analysis, which
usually involves dramatic alteration of the genome structure.
Our study provides a new perspective for 3D genome analysis. Previous
studies mainly focused on well‐known structures such as loops or TADs,
which only revealed the spatial adjacency of focal structures.
Contrarily, our D^2 model allows for universal measurements of the
genome structure and correlates the genomic structure with its
function. Nonetheless, we speculate that physical properties of DNA act
more as microenvironments to facilitate or repress gene transcription
activity rather than direct and decisive factors, as protein structure
to its function. In summary, our study paves the way for further
methodological exploration to link genome structure with its function.
4. Experimental Section
D^2 for Detecting Density and DisTP
Python script D2.py implemented the D^2 algorithm, which rapidly and
accurately computed the DNA density and DisTP values based on the
reconstructed genome structure. D^2 algorithm can be found at
[157]https://github.com/xjtu‐omics/D2.
Inputted Reconstructed Structures
Reconstructed genome structures, which were generated from single‐cell
Hi‐C, were the main input for the D^2 algorithm. Genome structures from
two research groups were collected in this study (Table [158]S1,
Supporting Information). Stevens et al.^[ [159]^16 ^] processed
single‐cell Hi‐C on eight individual haploid mouse ES cells and then
reconstructed the nuclear structures, using the NucDynamics algorithm.
The algorithm was designed based on molecular dynamic simulation.
First, the genome was divided into 100 kb genomic bins, each of which
represented a DNA particle in 3D structure. Then, a structure was
constructed using the physical force field, including attraction,
repel, and restriction. Using the simulated annealing algorithm, the
structures were constructed to optimally simulate actual structures.
The other group, Tan et al. developed a novel chromain conformation
capture method, Dip‐C, which can detect more contacts than single‐cell
Hi‐C. They omitted biotin pull down and conducted high‐coverage
whole‐genome amplification (META). Then, they used a genome
reconstruction algorithm, hickit, which was similar to NucDynamics and
enabled the construction of diploid cells at a relatively higher
resolution (20 kb).^[ [160]^10 ^]
Partition Particles by 3D Segmentation
Owing to the heterogeneity of nuclei sizes among cells, the
reconstructed structures were not directly comparable among cells. This
batch effect resulted in different nucleic radii and DNA densities. To
address this, the DNA particles were placed inside a 3D mesh
segmentation with equally shaped cubes, with cell‐specific cube length.
The length was set as the average distance to the nearest ith particle
of all particles from a given cell (i is the predetermined parameter,
default as 7), to ensure the average number of DNA particles within a
cube was approximately the same across cells. This predetermined
parameter was not sensitive for further analysis, because the following
analysis concentrated on the differences between the density and DisTP
of genomic regions (e.g., lower‐density regions), while the change in
this parameter increased or decreased density or DisTP universally. The
segmentation was constructed according to the cube length and covered
the particles. Then the spatial coordinate of particles was replaced
using the cube coordinate. This normalization was based purely on DNA
density, but the nucleic radius (half of the maximum distance among
filled cubes) was adequately normalized (Figure [161]S21, Supporting
Information). Furthermore, this partition reduced the number of tasks
from particles to cubes, thus efficiently speeding up and saving memory
during calculation.
Computing DNA Density
Once segmentation was performed, DNA density could simply be defined as
the number of particles per cube. However, the arbitrary partition
above could result in an inaccurate density. For example, if a DNA
particle resided in the transiting regions between compacted and loose
regions, it should be labeled as middle density. However, it may be
partitioned into high‐density or low‐density cubes, instead of a middle
one. To solve this problem, the density of a given cube was smoothened
by its nearby cubes through the following equation
[MATH: Di=∑j∈SCj/∥i⃗−j⃗∥numS
:MATH]
(1)
where D[i] is smoothed DNA density of the spatial cube i, C[i] is the
DNA particle number within the cube i, and S is the set of nearby
cubes. The nearby range was set as 3. A longer range would be more
optimized for calculation, but it was unnecessary as distant cubes
contributed little.
Computing Distance to Periphery
Previous methods for detecting DisTP were simple and, to a certain
extent, inaccurate (Note [162]S2, Supporting Information). One of them
was developed in 2017 by Stevens et al.^[ [163]^16 ^] It first
identified the empty cubes around the nonempty cubes as the surface.
Then, they calculated the DisTP as the nearest distance to the surface.
However, imaging experiments showed that chromocenters would repel DNA
inside the nucleus,^[ [164]^10 ^] which would be wrongly defined as the
nuclear surface in Steven's methods. Tan et al. used another metric,
the distance to the nuclear center.^[ [165]^10 ^] This circumvented the
chromocenter, but assumed the nuclei to be spherical. However, nuclei
were always in different shapes. Therefore, a new methodology that
accurately defined the periphery and then computed the DisTP was
introduced.
In this method (Figure [166]S1, Supporting Information), the outer
empty cubes were first defined as the membrane cubes. They were started
to define from the boundaries of cubes, and then were moved toward the
inner cubes. The definition was stopped when filled cubes were
encountered. To eliminate the dissociative outlier particles, the
definition was stopped only when there were enough particles within a
certain range (3 particles in 5 cubes by default). In other words, the
faraway cubes with only one or two particles were defined as membrane
cubes. Then, the DisTP for a given cube was calculated as the average
value of the top 10 minimum distances to membrane cubes. The minimum
distances were not used because they were discrete values and highly
sensitive. For example, if one inner cube was incorrectly defined as a
membrane cube, the cubes around it would be defined as periphery.
D^2 Plot: Construction of D^2 Plot
To combine density and DisTP for further systematic analysis, a 2D
density–DisTP matrix, the D^2 plot was constructed. Outlier bins were
eliminated during plotting. For the density axis, the distribution
resembled the normal distribution. Therefore, the 1.96 * SD was
selected from the center as the separating line, to retain 95% of the
center bins. For the DisTP axis, all the bins were maintained near the
periphery, but eliminated the top 5% bins further away from the
periphery. The ranges were similar among the diploid cell types, even
among the species (Table [167]S4, Supporting Information). Therefore, a
standard range was set for all diploid cells (1 to 3.2 for density,
1.21 to 16 for DisTP). Each axis was then divided into 15 equally
shaped bins. The bin number was set arbitrarily, but it was showed that
different bin numbers did not affect the enrichment patterns (Figure
[168]S22, Supporting Information). Each 2D bin at the D^2 plot
represented a specific physical state with a unique combination of
density and DisTP. Notably, states with fewer than 200 particles were
removed from subsequent analysis to ensure statistical significance
(marked as empty).
Computing Marker Enrichment Scores
To calculate the marker enrichment scores on the D^2 plot, genetic data
were weighted and averaged by the probability of the genomic segment
appearing at the physical state, as shown below
[MATH: Ea=∑GPga·NVg∑GPga :MATH]
(2)
where E[a] is the computed enrichment score of the genomic marker in
the physical state, a. NV[g] is the z‐scored normalized enrichment
score of a genetic marker at the genome segment, g. P[ga] is the
probability of the genome segment g appearing in the state a, defined
as the ratio of the number of cells whose genomic segment g located at
the state a with respect to the total cell number. Afterward, the
enrichment scores were normalized by z‐score normalization.
Through this procedure, the density and DisTP of single cells were
determined. There was another way of computing the enrichment scores,
where the density and DisTP were averaged first
[MATH: Ea′=∑GaNVgNGa :MATH]
(3)
where
[MATH: Ea′
:MATH]
is the alternative enrichment score derived from the average density
and DisTP at physical state, a. G[a] is the collection of genomic
segments whose average density and DisTP were located at state a.
N(G[a]) is the number of collections. The average values delineated the
bulk activities of a cell type more optimally, but were unable to
represent the raw density and DisTP detected from the single cell. For
example, the average values could not reflect the sparse distribution,
for example, the multimodal distribution.
To test if the average density and DisTP would give similar enrichment
patterns, the dip test was first carried out for unimodal distribution.
All genomic bins were unimodal for both density and DisTP (Table
[169]S3, Supporting Information, percentage of both p‐value < 0.01:
98.11–99.88%). Then, the enrichment score analysis with average density
and DisTP was repeated and it was found that it exhibited similar
patterns to Figure [170]4 (Figure [171]S23, Supporting Information),
indicating that the average density and DisTP equivalently reflect
biological activity.
Stochasticity of Marker Enrichment Score
Considering the high cell stochasticity of density and DisTP, the
stochasticity of marker enrichment scores was examined by computing
their SD using the following equation
[MATH: SDa=∑GPga·<
/mo>NVg−Ea2∑GPga :MATH]
(4)
where SD[ a ] is the calculated SD of the marker enrichment score at
physical state a. NV[g] , P[ga] , and G are defined as above. To
compare SD with the mean enrichment scores, they were presented
together at Figure [172]S24 of the Supporting Information. The max–min
normalization on mean values was utilized for comparison between
markers. This normalization was also used on SD values. Therefore,
markers with SD less than 0 denoted low stochasticity, while others
with greater than 1 denoted high stochasticity. Certain markers (e.g.,
H3K27me3 and H3K9me3) showed lower SD than the mean value, indicating
their consistent enrichment among cells. Nonetheless, most markers
showed a comparable or larger SD, indicating high stochasticity. To
test the robustness of the enrichment score analysis under such
stochasticity, the analysis was repeated using only one cell (Figure
[173]S25, Supporting Information). The enrichment patterns, even from
one cell, were highly similar to Figure [174]3, suggesting a variable
despite the robust correlation between physical properties and genetic
markers.
Hierarchy Cluster and Property Strength
To obtain a transcriptional pattern on the D^2 plot, 16
transcription‐related markers (Txn‐St, Txn‐W, Pr‐Ac, Pr‐W, En‐St, En‐W,
H3K27ac, H3K36me3, H3K4me3, H3K4me2, Alu, H3K9mne3, H3K27me3, L1, Ht‐L;
the abbreviations are spelled out in the legend of Figures [175]S7 and
[176]S10, Supporting Information) were selected and the hierarchy
cluster (scipy.cluster.hierarchy with parameter optimal_ordering =
True, method = “ward”) was performed on physical states from the D^2
plot. The hierarchy cluster number was set as 4 for both species.
The property strength reflected the intensity of impact of density or
DisTP on the transcription level. The active markers were selected and
the property strength was computed at a given DisTP as the slope
between property changes (x) and marker enrichment score (y), as
follows
[MATH: Sd,p=slopeof<
/mi>linearfittingx=MFGd,p,y=Pd,p :MATH]
(5)
where S [d,p] is the computed property strength of property p at DisTP
d. For density, the x input for linear fitting is the collection of all
densities at a given DisTP and y is the corresponding marker enrichment
score. For DisTP, x is the collection of the given DisTP and its nearby
DisTP, and y is the corresponding enrichment score. The property
strength reflected the association between the physical properties and
the transcription level.
Distribution of Lineage‐Specific Active Genes
The distribution of active genes was examined on the D^2 plot, to
validate the transcriptional activation model. Considering the
nonuniform distribution of whole genome bins on the D^2 plot, the
enrichment scores of active genes over the background were computed.
Specifically, the distribution of active genes was first calculated,
and then it was divided by the distribution of all genomic bins as the
background.
Trajectories on D^2 Plots
The activation index and its change were introduced to reveal the
trajectories of a given genomic region on the D^2 plot, which reflected
its transcriptional activation modes (Figure [177]S18, Supporting
Information). Specifically, the activation index was defined as the sum
of the enrichment scores of a given genomic bin on active or
intermediate regions. Based on the activation index, the top 10%
genomic bins were marked as the active bins, the bottom 10% as the
repressed bins, and the others as medium. Three transcriptional modes:
constantly active (CA), constantly repressed (CR), repressed to active
(RA), were identified. During OSN maturation, the bins, which were
active at all OSN cell stages, were marked as the CA bins. Conversely,
the bins repressed at all stages were marked as CR bins. RA bins
comprised bins that were not active at newborn OSNs, but active at
mature OSNs.
The genomic positions of exons were obtained from the gff file, and
those of L1s were obtained from RepeatMasker. Housekeeping genes were
obtained from the HRT Altas v1.1 database.^[ [178]^35 ^] Their
percentage of length at the given genomic bins was calculated.
The gene pathway enrichment analysis was performed by DAVID.^[ [179]^36
^] Only KEGG pathways and GO terms were chosen for enrichment.
Reference Genome
The mouse reference genome (GRCm38) and gene annotations (ALL) were
downloaded from the GENCODE ([180]https://www.gencodegenes.org/mouse/)
M19 release.
The human reference genome (GRCh37) and gene annotations (ALL) were
downloaded from the GENCODE ([181]https://www.gencodegenes.org/human/)
39 release.
Statistical Analysis
All of the Pearson correlations were two‐sided and were computed using
scipy.stats.pearsonr. The linear fitting was conducted using
scipy.stats.linregress, with parameter deg = 1. The sample size for
each statistical analysis is available in the figure legends or in the
Supporting Information. Fisher tests were two‐sided and computed by
scipy.stats.fisher_exact.
Preprocess of Genetic Markers
To integrate density and DisTP with other genomic data, the other data
were transferred into the same indexed data. All data (except
DamID‐LmnB1 data of mESC) were indexed by 20 kb genomic bins. For
unphased data, the data were copied to two haplotypes.
ChIP‐seq Data
ChIP‐seq sequenced protein binding sites. ChIP‐seq data of
transcription factors (CTCF, YY1, RAD21, SMC3, POLR2A, POLR2Ap5 for
human GM12878 cells) and histone modifications (H3K4me2, H3K4me3,
H3K9me3, H3K27ac, H3K27me3, H3K36me3 for both human GM12878 cells and
mouse neonatal neurons) were collected. These data, with different
types of processed files, were collected from ENCODE. Here the fold
change was utilized over the control file, because it provides whole
genome protein binding information. This file was formatted as BigWig
and was transformed to a bedgraph file using the UCSC script
bigWigToBedGraph. Then bedtools intersect were used to find the
intersected regions between the index file and the marker bedgraph
file. Next, the fold change of all 20 kb bins was weight‐averaged and
they were indexed according to the index file.
Genomic Repeat
The repeat data were obtained from RepeatMasker. Owing to the low
resolution of the 3D genome, only broadly distributed repeats would be
statistically significant for enrichment. Therefore, repeats with less
occurrences were filtered out. Only repeats with more than 20 000
occurrences were left. The remaining repeats with the index file were
intersected and then their percentage at a given genomic bin was
calculated, and finally they were indexed in the aforementioned
procedure.
Chromatin State
Chromatin states described the transcriptional states of genomic loci.
They were calculated using HMM and based on chip‐seq data.^[ [182]^37 ,
[183]^38 ^] Because chromatin states were discrete states,
intersections, calculated percentages, and indexed were performed
through processes similar to those of the repeats. Each genomic bin had
a value for each chromatin state, representing the length of the state
within that bin.
CpG
The CpG data were downloaded from ENCODE, formatted as bedMethyl. Sites
were filtered out without reads. The remaining sites were intersected
with the index file, the median CpG percentages were calculated in a
given bin, and finally they were indexed using the aforementioned
procedure.
DamID‐Lmn for Detecting LADs
DamID‐seq identified protein binding sites by expressing the proposed
DNA‐binding protein as a fusion protein with DNA methyltransferase.
Using DamID‐seq to detect the binding sites of Lmn‐B1 (Lamin‐B1), the
main component of nuclear lamina, is a popular method of detecting
LADs. The mESC LAD experimental data were utilized.^[ [184]^39 ^]
Because these data were discrete like genomic repeats, the data were
intersected, calculated, and indexed in the same way as the repeats.
Notably, the bin size was 100 kb for the used mES cells.
To validate DisTP, experimental LAD data were compared to the
probability of appearing at the nuclear periphery (percentages of cells
whose given genomic bin had DisTP below 2). As they were both indexed
by the same index file, they were associated with their index. The
outliers were eliminated using three‐sigma limits.
Noteworthy, due to the small number of available cells (n = 8), the
possible values for simulated LAD data are too sparse at 100 kb
resolution. Therefore, the nearby bins were concatenated to 2 Mb
resolution, in order to provide continuous percentages. The number of 2
Mb genomic bins is 1103. The two‐sided Pearson correlation was computed
using scipy.stats.pearsonr.
Stochasticity of Density and DisTP
To explore the stochasticity of DisTP, its standard deviations were
computed (SDs; Figure [185]S4, Supporting Information). SD was computed
among cells of given genomic bins, thereby reflecting the cell
heterogeneity. The lower SD meant a more stable DisTP value among
cells.
The stochasticity of density was computed similarly to DisTP. In
contrast with DisTP, the lower SD regions of density showed obvious
cell‐type specificity. Therefore, the Pearson correlation between
density mean and density SD (scipy.stats.pearsonr) was conducted, to
identify the values of density that were more stable.
Conflict of Interest
The authors declare no conflict of interest.
Authors Contribution
Y.C. developed the D^2 algorithm with the help of X.Y. Y.C. downloaded,
analyzed, and interpreted the data in current studies. P.J. and T.W.
provided help for analyzing data. Y.C. was the major contributor in
writing the manuscript. T.G. helped with the benchmark. X.Y., D.X., and
K.Y. revised the manuscript. K.Y. supervised the whole program. All
authors read and approved the final manuscript.
Supporting information
Supporting Information
[186]Click here for additional data file.^ (4.6MB, pdf)
Acknowledgements