Graphical abstract
[37]graphic file with name ga1.jpg
[38]Open in a new tab
Keywords: Spatial binding pattern, Cell reprogramming, Pioneer factor,
Multivariate linear regression
Abstract
Understanding the target regulation between pioneer factor and its
binding genes is crucial for improving the efficiency of TF-mediated
reprogramming. Oct4 as the only one factor that cannot be substituted
by other POU members, it is urgent need to develop a quantitative model
for describing the spatial binding pattern with its target genes. The
dynamic profiles of pioneer factor Oct4-binding showed that the major
wave occurs at the intermediate stage of cell reprogramming (from day 7
to day 15), and the promoter is the preferred targeting regions. The
Oct4-binding distributions perform significant chromosome bias. The
overall enrichment on chromosome 1–11 is higher than that on the
others. The dramatic event of TF-mediated reprogramming is mainly
concentrated on autosomes. We also found that the spatial binding
ability of Oct4 binding can be represented quantitatively by using
three parameters of peaks (height, width and distance). The dynamic
changes of Oct4-binding demonstrated that the width play more important
roles in regulating expression of target genes. At last, a multivariate
linear regression was introduced to establish the spatial binding model
of the Oct4-binding. The evaluation results confirmed that the height
and width is positively correlated with the gene expression. And the
additive interaction terms of height and width can better optimize the
model performance than the multiplicative terms. The best average
coefficients of determination of improved model achieved to 81.38%. Our
study will provide new insights into the cooperative regulation of
spatial binding pattern of pioneer factors in cell reprogramming.
1. Introduction
Somatic cell reprogramming is the process of reprogramming the
differentiated somatic cells into pluripotency or even totipotency
under specific induction conditions [39][1]. Since 2006, Yamanaka
successfully induced fibroblasts into induced pluripotent stem cells
(iPSCs) by screening, combination and overexpression of the four
transcription factors of Oct4, Sox2, c-Myc and Klf4, this discovery has
undoubtedly initiated the pioneer of transcriptional factor-mediated
reprogramming and overturned the previous understanding of
“Differentiation process is irreversible” [40][2], [41][3], [42][4],
[43][5]. The generation of iPSCs makes it possible to obtain
patient-specific stem cells without ethical disputes, which has great
potential in tissue and organ transplantation, stem cell therapy,
molecular mechanism of specific diseases and personalized medicine
[44][2], [45][6], [46][7], [47][8], [48][9], [49][10].
With the deepening of research, many studies have shown that Oct4 as
one of the four Yamanaka factors required for reprogramming somatic
cells into iPSCs and that it is the only factor that cannot be
substituted in this process by other members [50][11]. For example,
OCT4 and Sox2 can reprogram human umbilical cord blood stem cells
directly to pluripotency [51][5], but for endogenous neural stem cells
with high expression of Sox2 and c-Myc, only forced expression of OCT4
can generate iPSCs [52][12]. Moreover, OCT4 can recruit other factors
required for regulating the expression of its target genes, such as
chromatin remodeling complexes as an important partner of OCT4
interaction, which is contribute to the reorganization of nucleosome
positioning and is necessary for pluripotency [53][13], [54][14].
Additionally, Oct4 acting as a pioneer transcription factor not only
functions by recognizing and binding to DNA regulatory regions alone or
in cooperation with other TFs [55][15], [56][16], [57][17], [58][18],
but also by identifying DNA in closed chromatin state [59][19].
Similarly, some studies have also revealed that in the early stage of
reprogramming, Oct4 plays a pioneering role in the acquisition and
maintenance of cellular pluripotency by effectively binding to
noncoding regulatory regions of pluripotent regulatory genes, actively
opening up the local chromatin, turning on the genome of downstream
pluripotent regulatory network, and activating the expression of
multiple genes [60][14], [61][20], [62][21]. In the studies of TFs
impacting on transcriptional regulation, there are some quantitative
models have been constructed, in which the gene expression profile is
regarded as the response variable and various features related to Oct4
and other TFs are taken as the explanatory variables. Examples of such
features include characteristics or counts of motifs recognized by the
TFs [63][22], [64][23], [65][24], sum of motif occurrences weighted by
their distances from the target gene [66][25], the weighted sum of the
corresponding ChIP-Seq signal strength of TFs [67][15], [68][26],
[69][27], [70][28] and the number of TF-binding events [71][29]. Oct4
as the only one factor that cannot be substituted by other POU members
in cell reprogramming, but there are few quantitative studies on the
spatial binding pattern of Oct4 to regulate gene expression.
In our study, based on the high-throughput sequencing data generated by
ChIP-seq, a comprehensive genome DNA binding map of Oct4 with different
reprogramming time points was discussed. The chromosome bias of
Oct4-binding distributions was further explored. Next, we delineated
the genome-wide spatial binding pattern of Oct4 during reprogramming
and presented the multivariate regression model of Oct4 binding pattern
impacting gene expression levels. At last, we identified some target
genes to analysis the cooperative regulatory relationship. The results
of this study will be helpful for improving the efficiency of
TF-mediated reprogramming.
2. Materials and methods
2.1. Dataset collection
The ChIP-seq data of this study were downloaded from Gene Expression
Omnibus (GEO) database under accession number [72]GSE67520. The
ChIP-seq data of Oct4 contains nine reprogramming time points from
mouse fibroblasts into iPSCs, including fibroblast status, days 1, 3,
5, 7, 11, 15, 18, induced pluripotent stem cell status (D0, D1, D3, D5,
D7, D11, D15, D18, DIPSC). The microarray transcriptome data was also
downloaded from the GEO database, and GEO accession no. [73]GSE67462,
which includes two biologically repeated gene expression microarray
data of 18 samples at different time points of reprogramming (D0, D1,
D3, D5, D7, D11, D15, D18 and DIPSC) in mice.
2.2. Data processing
The downloaded ChIP-seq original fastq format data were controlled by
FastQC software
([74]http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to
remove low-quality samples. Next Oct4 ChIP-seq reads were aligned to
the mouse reference genome (Mm9 assembly) using Bowtie (version 0.12.7)
[75][30] short read alignment software with parameter setting: -v2-m1
and other default values. In our study only tags uniquely mapped to the
genome were retained. Then Oct4 binding peaks were called by MACS2
(version 2.1.0) [76][31] with the input data as the control. Finally
using R package ChIPseeker [77][32] annotated with the position of the
peaks in the genome, in which −5 kb to +0.5 kb of gene transcription
start sites (TSS) were defined gene promoter, and gene enhancer were
defined as upstream 5 kb to 50 kb of gene TSS. In addition, the
expression value of replicates at different time points were averaged
as the final gene expression value data.
2.3. Integration of Oct4 spatial binding parameters
Since each target gene is not usually corresponding to a unique peak
([78]Fig. S2A), the many-to-single relationship between the peaks and
the gene need to redefine. For each peak–gene pair, we integrated the
binding parameters of the measured Oct4 peaks around each gene promoter
into spatial binding parameters of a single peak, namely H[i], D[i],
W[i] (Height[i], Distance[i], Width[i]) ([79]Fig. 1A). We assumed that
the height of peak binding on gene i was a weighted average of height
values of all of the peaks on gene i:
[MATH: Hi=∑k=1jhki/jk∈j<
/mrow> :MATH]
(1)
where
[MATH: hki :MATH]
is the height (the score of an individual peak binding site) of the kth
peak of Oct4 binding on the gene i, j is a sum of the number of all of
the peaks on gene i. And the distance (the distance from the peak
center to the TSS) and width (the difference value between the endpoint
coordinates and the starting coordinates of peak) of peak binding on
gene i as follow:
[MATH: Wi=maxw∑k=1ji,k<
/mrow> :MATH]
(2)
[MATH: Di=mind∑k=1ji,k<
/mrow> :MATH]
(3)
The d (i, k) is the distance of the kth peak to TSS of gene i, and
[MATH: Di :MATH]
indicates the minimum distance between peak k and the gene i among all
of the peaks on gene i. The w (i, k) indicates the width of the kth
peak binding on the gene i, and
[MATH: Wi :MATH]
is the maximum width of the peak binding on the gene i among all of the
peak width values.
Fig. 1.
[80]Fig. 1
[81]Open in a new tab
Dynamic profiles of Oct4-binding and characteristic of chromosomal
distribution. (A) Flowchart of this study. (B) Dynamic profiles of Oct4
binding in the whole genome during reprogramming. (C) Bar plot of the
distribution of Oct4_Peak on chromosomes at different time points of
programming. The thickness of connecting lines between chromosomes and
time points is proportional of the number of peaks. (D) The number of
Oct4 peaks distributed on different chromosomes during reprogramming.
(E) The boxplot represents the dynamic change of height of peaks during
reprogramming. (F) The boxplot represents the dynamic change of width
of peaks during reprogramming. (G) Kernel Density plot of width of
Oct4-binding on different target genes during reprogramming. (H) The
dynamic change of width of Oct4 peaks during reprogramming.
2.4. Quantitative correlation analysis and multivariate regression
Next, we calculated the Pearson correlation coefficient (PCC) between
the three spatial parameters of Oct4 binding on its target genes and
the gene expression to quantitatively describe the correlation between
the two in the reprogramming process. The PCC value [82][33] was
introduced as follow:
[MATH:
PX,Y=∑i
=1nXi-X
-Yi-Y
-∑i=1n
Xi-X
-2∑i=1nYi-Y
-2
-1≤PX,Y≤1
mfenced> :MATH]
(4)
where X[i] (H[i], D[i], W[i]) is the spatial parameters of Oct4 binding
on gene i and Y[i] is the true expression of gene i. n is a total
number of genes at the same reprogramming time point. The higher PCC
indicates the stronger correlation.
Based on the obtained good quantitative correlations, we described the
expression level of gene i by a linear regression model:
[MATH: Yi^~μi+αi<
/mrow>Hi+βi<
msub>Di+γiWi :MATH]
(5)
where
[MATH: Yi^ :MATH]
is the theoretical expression level of gene i,
[MATH: μi :MATH]
is a constant,
[MATH: Hi :MATH]
is the height value of Oct4 peak for the ith gene and α[i] is the
regression coefficient for the peak height on ith gene. Similarly,
[MATH: Di :MATH]
is the distance value of Oct4 peak for the ith gene and
[MATH: βi :MATH]
is the regression coefficient for the peak distance,
[MATH: Wi :MATH]
is the width of peak for the ith gene and
[MATH: γi :MATH]
also is the regression coefficient.
The coefficients of regression model are estimated by using least
squares:
[MATH: μi,α<
/mi>i,βi,
mo>γi=argminμi,αi,βi
,γi
∑i=1n(Yi-Yi^)2<
/mrow> :MATH]
(6)
So a simple multivariate linear regression model was given by
[MATH: Y^=μ+αH+βD+γW :MATH]
(7)
where
[MATH: Y^ :MATH]
is the theoretical expression level,
[MATH: μ :MATH]
is a constant. And
[MATH: α,β,γ
:MATH]
are the regression coefficients for
[MATH: H,D,W
:MATH]
, respectively. To investigate the cooperative regulation of spatial
binding pattern, we exhaustively searched interaction terms from our
multivariate linear regression model, such as the additive and
multiplicative interaction terms of height and width, the square
interaction term of the distance. The interaction terms of parameters
were added to the linear regression model to get five new models, which
can be formulated as
[MATH: Y^=μ+α<
mo>∗H∗W2
+β∗D+H+γ∗D+W8Y^=μ+α<
mo>∗H+W2
+β∗D-H+γ∗D-W9Y^=μ+α<
mo>∗H∗W+β∗D+H+γ∗D+W10Y^=μ+α<
mo>∗H∗W+β∗D2
+γ∗D+H+η∗D+W11Y^=μ+α<
mo>∗H+W2
+β∗D2+γ∗D-H+η∗D-W12 :MATH]
2.5. KEGG pathway enrichment analysis
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway-based analysis
helps us to further understand genes different biological functions
[83][34] at different time points of reprogramming. So we performed
KEGG pathway enrichment analysis on the unique and shared genes at four
time points of reprogramming (D11, D15, D18 and DIPSC) with the
Database for Annotation, Visualization and Integrated Discovery (DAVID)
Bioinformatics Resource [84][35] ([85]Fig. 1A). The horizontal axis
indicates time points of programming and vertical ordinates are the
terms of the KEGG pathways. Gene ratio is the proportion of the number
of genes vs. the total number of genes in the same KEGG pathway. The
color represents p value.
2.6. Analysis of gene co-expression and network construction
The gene co-expression networks of Pou5f1, Xrcc5, Fubp1, Tbx3, Aurkb,
Cd109, Col25a1 and Kdm5a the eight genes we screened. The WGCNA R
package was used to establish co-expression networks of these genes
[86][36] ([87]Fig. 1A). Next the network was visualized by
Cytoscape3.7.0 [88][37]. In the network, the nodes represent the genes
and the edge correlates with the regulation capacity for adjacent
genes. The larger the edge, the stronger the regulatory relationship
between the two genes and the more edges of a gene means the more
central role it has within the network.
2.7. Data visualization
In this study, data visualization was carried out mainly by the R,
including the R/Bioconductor ([89]http://www.bioconductor.org). The
heatmap and Upset plot were produced using Pheatmap and UpsetR
[90][38], respectively. The genome browser view was obtained using the
Integrative Genomics Viewer (IGV) [91][39] and the density graph,
boxplot, bubble graph and so on were generated with the R packet
ggplot2 ([92]http://ggplot2.org/).
3. Results
3.1. Dynamic profiles and chromosomal distribution of Oct4-binding in the
whole genome
Firstly, the counts of Oct4-binding peaks in five genomic regions,
5′UTR, 3′UTR, Enhancer, Promoter and GeneBody ([93]Fig. 1B) were
selected to describe dynamic profiles of Oct4-binding during
reprogramming. And we proposed the peaks per kilobase mapped peaks
(PPKM,
[MATH: PPKM=∑Peak
CountThelengthofregion*1000
:MATH]
) to reduce the effects of region length on the statistical results.
We found that the major wave occurs at the intermediate stage of cell
reprogramming (from day 7 to day 15). Especially in promoter regions,
all of the PPKM reached more than 20000, showing that the promoter is
the preferential targeting genomics regions. In addition, UTR and
GeneBody regions also had high proportion of Oct4-binding ([94]Fig.
1B). During the reprogramming, the quantity change of Oct4 targeted
genes was consistent with the peaks and reached the most at day 15
([95]Fig. S1A and B). Interestingly, at the initial stage of
reprogramming (D1-D3), the number of the genes and peaks also exhibited
a high level, indicating that Oct4 not only has a targeted regulatory
effect on pluripotent genes, but also plays a certain role in some
housekeeping genes [96][40]. We further explored the number
relationship between the peaks and genes ([97]Fig. S2A). In the process
of reprogramming (expect D1), each target gene corresponded to one or
two peaks in all regions, and the one-to-one relationship was more
obvious in promoter regions. Thus, it can be concluded that there is
not a unique peak on the same regulatory element of the same target
gene.
For the distributions of Oct4-binding on different chromosomes, the
results showed that Oct4 was more inclined to bind on autosomes,
especially on chromosome 1, 2, 7 and 11 (Chr1, Chr2, Chr7 and Chr11)
were the most significantly ([98]Fig. 1C and S1C). The specific number
of peak distributions on different chromosomes was marked in detail
([99]Fig. 1D). The number of Oct4 peaks distributed on all autosomal
chromosomes was more than 1000 throughout the reprogramming process.
And at the days 7–18 Oct4 tended to bind on Chr1-Chr11 and Chr17.
Surprisingly, the number of peaks enriched on chromosome 1, 2, and 11
(Chr1, Chr2, and Chr11) was more than 4000 (except DIPSC) and the
ontologies for the genes represented by these peaks on these three
chromosomes were been shown in [100]Supplementary Table 1. However,
there are a few peaks mapped to sex chromosomes, especially on the Y
chromosome (ChrY) with almost no Oct4 binding. It indicated the
dramatic event of TF-mediated reprogramming is concentrated on
autosomes.
In addition, we selected the Oct4 peaks on day 3 of the early
reprogramming, day 7 (the number of peaks significantly increased), and
day 15 (the number of peaks reached the maximum), respectively to
reveal the dynamic pattern of Oct4-binding. The results showed the
median of peak height on all chromosomes (except ChrY) was around 400,
and the change was not obvious ([101]Fig. S1D). The significant change
on ChrY was related to the low number of peaks distributed on this
chromosome ([102]Fig. 1D). The median of distance on D15 was lower than
other two days, and the overall levels on D3 were the highest. However,
the width of peaks was more than 400 in all chromosomes (except ChrY)
on D15. Surprisingly, the height and width of peaks binding on Chr11,
Chr13 and Chr17 showed evidently increased.
The dynamic changes of Oct4 spatial parameters in the whole genome as
shown in [103]Fig. 1E-1H, the height of peaks was generally 300–600 at
different time points of reprogramming (except D0 and DIPSC), and the
overall trend was not obvious ([104]Fig. 1E). The width was 200–400 bp
on days 1–5 (D1-D5) and days 18-IPSC (D18-DIPSC), but on days 7 and 11
the width gradually increased to 400–800 bp ([105]Fig. 1F and 1G).
Generally, the width of peaks tended to widen first and then narrow
during the reprogramming ([106]Fig. 1H). The above results demonstrated
that the height of Oct4 combined with its target genes does not change
significantly, but the width tended to widen, implying the width may
play a vital role in promoting the reprogramming.
3.2. Dynamics of Oct4 binding in promoter regions during reprogramming
The dynamic profiles of Oct4-binding showed the promoter is the main
targeting genomics regions. So we further illustrated the
characteristic of Oct4 binding in promoter regions. Throughout the
reprogramming process, the three spatial binding parameters of the
peaks binding on each target gene promoter are processed, respectively
(as shown in Methods).
We found that Oct4 extensively bound to the promoter from day 7 until
day 18 (D7-D18) and the number of target genes was prominent increased
at these days ([107]Fig. 2A), implying that the targeted regulation of
Oct4 is dynamic with time preference, as in the previous study
[108][40]. As shown in [109]Fig. 2B, the peaks of Oct4 were mainly
located within 500 bp (−500, 500) of the TSS and relatively few located
beyond upstream 2 kb of TSS (−3000, −2000) in the reprogramming
process. Intriguingly, there was a tendency for Oct4 to bind to
upstream 500 bp regions from TSS at the early stage of reprogramming
(D3–D7) (blue box of [110]Fig. 2B), but at the end stage of
reprogramming (D11–D18) Oct4 mainly enriched in downstream 500 bp
regions from the TSS (pink box of [111]Fig. 2B). The results suggest
that Oct4 first acts on the TSS upstream and then targeted binding on
TSS downstream of target genes to facilitate the reprogramming process.
Next, we questioned how Oct4 binding in promoter regions regulates gene
expression. The quantitative analysis of Oct4 spatial binding
parameters and its target gene expression levels during reprogramming
were shown in [112]Fig. 2C. The height of Oct4 peaks binding on its
target genes remained relatively steady and there was a slight increase
on day 15 (D15), but the distance of peaks tended to decrease from day
1 (D1) until day 18 (D18), and the overall changes were not obvious.
Surprisingly, the width of peaks increased significantly from day 7
until day 18 (D7–D18) coincided with the increase of gene expression
levels. Therefore, it can be inferred that during the reprogramming
process, the spatial binding pattern of Oct4 plays a pioneering role in
activating the expression of its target genes, which further proves
that the height and width of Oct4 peaks may facilitate the gene
expression, and distance may be a negative factor in this process.
Fig. 2.
[113]Fig. 2
[114]Open in a new tab
The dynamics binding of Oct4 in promoter regions. (A) The number of
Oct4 target genes at different reprogramming time points is represented
as bar graph, and the specific number is marked at the top of the bar
graph. (B) Distribution of Oct4-binding sites with respect to the TSS
in promoter at different reprogramming time points is shown. (C)
Boxplots of height (upper left), width (left bottom) and distance
values of Oct4 peaks (upper right) located in target genes promoter and
its target genes expression in promoter (right bottom). (D) Genome
browser view at the Erbb3, Sall1, Lama3 and Lin28a locus of Oct4
binding at the different time points of reprogramming, respectively.
And the blue boxes represent the promoters of these genes and their
surrounding regions. (For interpretation of the references to color in