Graphical abstract [37]graphic file with name ga1.jpg [38]Open in a new tab Keywords: Spatial binding pattern, Cell reprogramming, Pioneer factor, Multivariate linear regression Abstract Understanding the target regulation between pioneer factor and its binding genes is crucial for improving the efficiency of TF-mediated reprogramming. Oct4 as the only one factor that cannot be substituted by other POU members, it is urgent need to develop a quantitative model for describing the spatial binding pattern with its target genes. The dynamic profiles of pioneer factor Oct4-binding showed that the major wave occurs at the intermediate stage of cell reprogramming (from day 7 to day 15), and the promoter is the preferred targeting regions. The Oct4-binding distributions perform significant chromosome bias. The overall enrichment on chromosome 1–11 is higher than that on the others. The dramatic event of TF-mediated reprogramming is mainly concentrated on autosomes. We also found that the spatial binding ability of Oct4 binding can be represented quantitatively by using three parameters of peaks (height, width and distance). The dynamic changes of Oct4-binding demonstrated that the width play more important roles in regulating expression of target genes. At last, a multivariate linear regression was introduced to establish the spatial binding model of the Oct4-binding. The evaluation results confirmed that the height and width is positively correlated with the gene expression. And the additive interaction terms of height and width can better optimize the model performance than the multiplicative terms. The best average coefficients of determination of improved model achieved to 81.38%. Our study will provide new insights into the cooperative regulation of spatial binding pattern of pioneer factors in cell reprogramming. 1. Introduction Somatic cell reprogramming is the process of reprogramming the differentiated somatic cells into pluripotency or even totipotency under specific induction conditions [39][1]. Since 2006, Yamanaka successfully induced fibroblasts into induced pluripotent stem cells (iPSCs) by screening, combination and overexpression of the four transcription factors of Oct4, Sox2, c-Myc and Klf4, this discovery has undoubtedly initiated the pioneer of transcriptional factor-mediated reprogramming and overturned the previous understanding of “Differentiation process is irreversible” [40][2], [41][3], [42][4], [43][5]. The generation of iPSCs makes it possible to obtain patient-specific stem cells without ethical disputes, which has great potential in tissue and organ transplantation, stem cell therapy, molecular mechanism of specific diseases and personalized medicine [44][2], [45][6], [46][7], [47][8], [48][9], [49][10]. With the deepening of research, many studies have shown that Oct4 as one of the four Yamanaka factors required for reprogramming somatic cells into iPSCs and that it is the only factor that cannot be substituted in this process by other members [50][11]. For example, OCT4 and Sox2 can reprogram human umbilical cord blood stem cells directly to pluripotency [51][5], but for endogenous neural stem cells with high expression of Sox2 and c-Myc, only forced expression of OCT4 can generate iPSCs [52][12]. Moreover, OCT4 can recruit other factors required for regulating the expression of its target genes, such as chromatin remodeling complexes as an important partner of OCT4 interaction, which is contribute to the reorganization of nucleosome positioning and is necessary for pluripotency [53][13], [54][14]. Additionally, Oct4 acting as a pioneer transcription factor not only functions by recognizing and binding to DNA regulatory regions alone or in cooperation with other TFs [55][15], [56][16], [57][17], [58][18], but also by identifying DNA in closed chromatin state [59][19]. Similarly, some studies have also revealed that in the early stage of reprogramming, Oct4 plays a pioneering role in the acquisition and maintenance of cellular pluripotency by effectively binding to noncoding regulatory regions of pluripotent regulatory genes, actively opening up the local chromatin, turning on the genome of downstream pluripotent regulatory network, and activating the expression of multiple genes [60][14], [61][20], [62][21]. In the studies of TFs impacting on transcriptional regulation, there are some quantitative models have been constructed, in which the gene expression profile is regarded as the response variable and various features related to Oct4 and other TFs are taken as the explanatory variables. Examples of such features include characteristics or counts of motifs recognized by the TFs [63][22], [64][23], [65][24], sum of motif occurrences weighted by their distances from the target gene [66][25], the weighted sum of the corresponding ChIP-Seq signal strength of TFs [67][15], [68][26], [69][27], [70][28] and the number of TF-binding events [71][29]. Oct4 as the only one factor that cannot be substituted by other POU members in cell reprogramming, but there are few quantitative studies on the spatial binding pattern of Oct4 to regulate gene expression. In our study, based on the high-throughput sequencing data generated by ChIP-seq, a comprehensive genome DNA binding map of Oct4 with different reprogramming time points was discussed. The chromosome bias of Oct4-binding distributions was further explored. Next, we delineated the genome-wide spatial binding pattern of Oct4 during reprogramming and presented the multivariate regression model of Oct4 binding pattern impacting gene expression levels. At last, we identified some target genes to analysis the cooperative regulatory relationship. The results of this study will be helpful for improving the efficiency of TF-mediated reprogramming. 2. Materials and methods 2.1. Dataset collection The ChIP-seq data of this study were downloaded from Gene Expression Omnibus (GEO) database under accession number [72]GSE67520. The ChIP-seq data of Oct4 contains nine reprogramming time points from mouse fibroblasts into iPSCs, including fibroblast status, days 1, 3, 5, 7, 11, 15, 18, induced pluripotent stem cell status (D0, D1, D3, D5, D7, D11, D15, D18, DIPSC). The microarray transcriptome data was also downloaded from the GEO database, and GEO accession no. [73]GSE67462, which includes two biologically repeated gene expression microarray data of 18 samples at different time points of reprogramming (D0, D1, D3, D5, D7, D11, D15, D18 and DIPSC) in mice. 2.2. Data processing The downloaded ChIP-seq original fastq format data were controlled by FastQC software ([74]http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to remove low-quality samples. Next Oct4 ChIP-seq reads were aligned to the mouse reference genome (Mm9 assembly) using Bowtie (version 0.12.7) [75][30] short read alignment software with parameter setting: -v2-m1 and other default values. In our study only tags uniquely mapped to the genome were retained. Then Oct4 binding peaks were called by MACS2 (version 2.1.0) [76][31] with the input data as the control. Finally using R package ChIPseeker [77][32] annotated with the position of the peaks in the genome, in which −5 kb to +0.5 kb of gene transcription start sites (TSS) were defined gene promoter, and gene enhancer were defined as upstream 5 kb to 50 kb of gene TSS. In addition, the expression value of replicates at different time points were averaged as the final gene expression value data. 2.3. Integration of Oct4 spatial binding parameters Since each target gene is not usually corresponding to a unique peak ([78]Fig. S2A), the many-to-single relationship between the peaks and the gene need to redefine. For each peak–gene pair, we integrated the binding parameters of the measured Oct4 peaks around each gene promoter into spatial binding parameters of a single peak, namely H[i], D[i], W[i] (Height[i], Distance[i], Width[i]) ([79]Fig. 1A). We assumed that the height of peak binding on gene i was a weighted average of height values of all of the peaks on gene i: [MATH: Hi=k=1jhki/jkj< /mrow> :MATH] (1) where [MATH: hki :MATH] is the height (the score of an individual peak binding site) of the kth peak of Oct4 binding on the gene i, j is a sum of the number of all of the peaks on gene i. And the distance (the distance from the peak center to the TSS) and width (the difference value between the endpoint coordinates and the starting coordinates of peak) of peak binding on gene i as follow: [MATH: Wi=maxwk=1ji,k< /mrow> :MATH] (2) [MATH: Di=mindk=1ji,k< /mrow> :MATH] (3) The d (i, k) is the distance of the kth peak to TSS of gene i, and [MATH: Di :MATH] indicates the minimum distance between peak k and the gene i among all of the peaks on gene i. The w (i, k) indicates the width of the kth peak binding on the gene i, and [MATH: Wi :MATH] is the maximum width of the peak binding on the gene i among all of the peak width values. Fig. 1. [80]Fig. 1 [81]Open in a new tab Dynamic profiles of Oct4-binding and characteristic of chromosomal distribution. (A) Flowchart of this study. (B) Dynamic profiles of Oct4 binding in the whole genome during reprogramming. (C) Bar plot of the distribution of Oct4_Peak on chromosomes at different time points of programming. The thickness of connecting lines between chromosomes and time points is proportional of the number of peaks. (D) The number of Oct4 peaks distributed on different chromosomes during reprogramming. (E) The boxplot represents the dynamic change of height of peaks during reprogramming. (F) The boxplot represents the dynamic change of width of peaks during reprogramming. (G) Kernel Density plot of width of Oct4-binding on different target genes during reprogramming. (H) The dynamic change of width of Oct4 peaks during reprogramming. 2.4. Quantitative correlation analysis and multivariate regression Next, we calculated the Pearson correlation coefficient (PCC) between the three spatial parameters of Oct4 binding on its target genes and the gene expression to quantitatively describe the correlation between the two in the reprogramming process. The PCC value [82][33] was introduced as follow: [MATH: PX,Y=i =1nXi-X -Yi-Y -i=1n Xi-X -2i=1nYi-Y -2 -1PX,Y1 :MATH] (4) where X[i] (H[i], D[i], W[i]) is the spatial parameters of Oct4 binding on gene i and Y[i] is the true expression of gene i. n is a total number of genes at the same reprogramming time point. The higher PCC indicates the stronger correlation. Based on the obtained good quantitative correlations, we described the expression level of gene i by a linear regression model: [MATH: Yi^~μi+αi< /mrow>Hi+βi< msub>Di+γiWi :MATH] (5) where [MATH: Yi^ :MATH] is the theoretical expression level of gene i, [MATH: μi :MATH] is a constant, [MATH: Hi :MATH] is the height value of Oct4 peak for the ith gene and α[i] is the regression coefficient for the peak height on ith gene. Similarly, [MATH: Di :MATH] is the distance value of Oct4 peak for the ith gene and [MATH: βi :MATH] is the regression coefficient for the peak distance, [MATH: Wi :MATH] is the width of peak for the ith gene and [MATH: γi :MATH] also is the regression coefficient. The coefficients of regression model are estimated by using least squares: [MATH: μi,α< /mi>i,βi,γi=argminμi,αi,βi ,γi i=1n(Yi-Yi^)2< /mrow> :MATH] (6) So a simple multivariate linear regression model was given by [MATH: Y^=μ+αH+βD+γW :MATH] (7) where [MATH: Y^ :MATH] is the theoretical expression level, [MATH: μ :MATH] is a constant. And [MATH: α,β,γ :MATH] are the regression coefficients for [MATH: H,D,W :MATH] , respectively. To investigate the cooperative regulation of spatial binding pattern, we exhaustively searched interaction terms from our multivariate linear regression model, such as the additive and multiplicative interaction terms of height and width, the square interaction term of the distance. The interaction terms of parameters were added to the linear regression model to get five new models, which can be formulated as [MATH: Y^=μ+α< mo>∗HW2 +βD+H+γD+W8Y^=μ+α< mo>∗H+W2 +βD-H+γD-W9Y^=μ+α< mo>∗HW+βD+H+γD+W10Y^=μ+α< mo>∗HW+βD2 +γD+H+ηD+W11Y^=μ+α< mo>∗H+W2 +βD2+γD-H+ηD-W12 :MATH] 2.5. KEGG pathway enrichment analysis Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway-based analysis helps us to further understand genes different biological functions [83][34] at different time points of reprogramming. So we performed KEGG pathway enrichment analysis on the unique and shared genes at four time points of reprogramming (D11, D15, D18 and DIPSC) with the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resource [84][35] ([85]Fig. 1A). The horizontal axis indicates time points of programming and vertical ordinates are the terms of the KEGG pathways. Gene ratio is the proportion of the number of genes vs. the total number of genes in the same KEGG pathway. The color represents p value. 2.6. Analysis of gene co-expression and network construction The gene co-expression networks of Pou5f1, Xrcc5, Fubp1, Tbx3, Aurkb, Cd109, Col25a1 and Kdm5a the eight genes we screened. The WGCNA R package was used to establish co-expression networks of these genes [86][36] ([87]Fig. 1A). Next the network was visualized by Cytoscape3.7.0 [88][37]. In the network, the nodes represent the genes and the edge correlates with the regulation capacity for adjacent genes. The larger the edge, the stronger the regulatory relationship between the two genes and the more edges of a gene means the more central role it has within the network. 2.7. Data visualization In this study, data visualization was carried out mainly by the R, including the R/Bioconductor ([89]http://www.bioconductor.org). The heatmap and Upset plot were produced using Pheatmap and UpsetR [90][38], respectively. The genome browser view was obtained using the Integrative Genomics Viewer (IGV) [91][39] and the density graph, boxplot, bubble graph and so on were generated with the R packet ggplot2 ([92]http://ggplot2.org/). 3. Results 3.1. Dynamic profiles and chromosomal distribution of Oct4-binding in the whole genome Firstly, the counts of Oct4-binding peaks in five genomic regions, 5′UTR, 3′UTR, Enhancer, Promoter and GeneBody ([93]Fig. 1B) were selected to describe dynamic profiles of Oct4-binding during reprogramming. And we proposed the peaks per kilobase mapped peaks (PPKM, [MATH: PPKM=Peak CountThelengthofregion*1000 :MATH] ) to reduce the effects of region length on the statistical results. We found that the major wave occurs at the intermediate stage of cell reprogramming (from day 7 to day 15). Especially in promoter regions, all of the PPKM reached more than 20000, showing that the promoter is the preferential targeting genomics regions. In addition, UTR and GeneBody regions also had high proportion of Oct4-binding ([94]Fig. 1B). During the reprogramming, the quantity change of Oct4 targeted genes was consistent with the peaks and reached the most at day 15 ([95]Fig. S1A and B). Interestingly, at the initial stage of reprogramming (D1-D3), the number of the genes and peaks also exhibited a high level, indicating that Oct4 not only has a targeted regulatory effect on pluripotent genes, but also plays a certain role in some housekeeping genes [96][40]. We further explored the number relationship between the peaks and genes ([97]Fig. S2A). In the process of reprogramming (expect D1), each target gene corresponded to one or two peaks in all regions, and the one-to-one relationship was more obvious in promoter regions. Thus, it can be concluded that there is not a unique peak on the same regulatory element of the same target gene. For the distributions of Oct4-binding on different chromosomes, the results showed that Oct4 was more inclined to bind on autosomes, especially on chromosome 1, 2, 7 and 11 (Chr1, Chr2, Chr7 and Chr11) were the most significantly ([98]Fig. 1C and S1C). The specific number of peak distributions on different chromosomes was marked in detail ([99]Fig. 1D). The number of Oct4 peaks distributed on all autosomal chromosomes was more than 1000 throughout the reprogramming process. And at the days 7–18 Oct4 tended to bind on Chr1-Chr11 and Chr17. Surprisingly, the number of peaks enriched on chromosome 1, 2, and 11 (Chr1, Chr2, and Chr11) was more than 4000 (except DIPSC) and the ontologies for the genes represented by these peaks on these three chromosomes were been shown in [100]Supplementary Table 1. However, there are a few peaks mapped to sex chromosomes, especially on the Y chromosome (ChrY) with almost no Oct4 binding. It indicated the dramatic event of TF-mediated reprogramming is concentrated on autosomes. In addition, we selected the Oct4 peaks on day 3 of the early reprogramming, day 7 (the number of peaks significantly increased), and day 15 (the number of peaks reached the maximum), respectively to reveal the dynamic pattern of Oct4-binding. The results showed the median of peak height on all chromosomes (except ChrY) was around 400, and the change was not obvious ([101]Fig. S1D). The significant change on ChrY was related to the low number of peaks distributed on this chromosome ([102]Fig. 1D). The median of distance on D15 was lower than other two days, and the overall levels on D3 were the highest. However, the width of peaks was more than 400 in all chromosomes (except ChrY) on D15. Surprisingly, the height and width of peaks binding on Chr11, Chr13 and Chr17 showed evidently increased. The dynamic changes of Oct4 spatial parameters in the whole genome as shown in [103]Fig. 1E-1H, the height of peaks was generally 300–600 at different time points of reprogramming (except D0 and DIPSC), and the overall trend was not obvious ([104]Fig. 1E). The width was 200–400 bp on days 1–5 (D1-D5) and days 18-IPSC (D18-DIPSC), but on days 7 and 11 the width gradually increased to 400–800 bp ([105]Fig. 1F and 1G). Generally, the width of peaks tended to widen first and then narrow during the reprogramming ([106]Fig. 1H). The above results demonstrated that the height of Oct4 combined with its target genes does not change significantly, but the width tended to widen, implying the width may play a vital role in promoting the reprogramming. 3.2. Dynamics of Oct4 binding in promoter regions during reprogramming The dynamic profiles of Oct4-binding showed the promoter is the main targeting genomics regions. So we further illustrated the characteristic of Oct4 binding in promoter regions. Throughout the reprogramming process, the three spatial binding parameters of the peaks binding on each target gene promoter are processed, respectively (as shown in Methods). We found that Oct4 extensively bound to the promoter from day 7 until day 18 (D7-D18) and the number of target genes was prominent increased at these days ([107]Fig. 2A), implying that the targeted regulation of Oct4 is dynamic with time preference, as in the previous study [108][40]. As shown in [109]Fig. 2B, the peaks of Oct4 were mainly located within 500 bp (−500, 500) of the TSS and relatively few located beyond upstream 2 kb of TSS (−3000, −2000) in the reprogramming process. Intriguingly, there was a tendency for Oct4 to bind to upstream 500 bp regions from TSS at the early stage of reprogramming (D3–D7) (blue box of [110]Fig. 2B), but at the end stage of reprogramming (D11–D18) Oct4 mainly enriched in downstream 500 bp regions from the TSS (pink box of [111]Fig. 2B). The results suggest that Oct4 first acts on the TSS upstream and then targeted binding on TSS downstream of target genes to facilitate the reprogramming process. Next, we questioned how Oct4 binding in promoter regions regulates gene expression. The quantitative analysis of Oct4 spatial binding parameters and its target gene expression levels during reprogramming were shown in [112]Fig. 2C. The height of Oct4 peaks binding on its target genes remained relatively steady and there was a slight increase on day 15 (D15), but the distance of peaks tended to decrease from day 1 (D1) until day 18 (D18), and the overall changes were not obvious. Surprisingly, the width of peaks increased significantly from day 7 until day 18 (D7–D18) coincided with the increase of gene expression levels. Therefore, it can be inferred that during the reprogramming process, the spatial binding pattern of Oct4 plays a pioneering role in activating the expression of its target genes, which further proves that the height and width of Oct4 peaks may facilitate the gene expression, and distance may be a negative factor in this process. Fig. 2. [113]Fig. 2 [114]Open in a new tab The dynamics binding of Oct4 in promoter regions. (A) The number of Oct4 target genes at different reprogramming time points is represented as bar graph, and the specific number is marked at the top of the bar graph. (B) Distribution of Oct4-binding sites with respect to the TSS in promoter at different reprogramming time points is shown. (C) Boxplots of height (upper left), width (left bottom) and distance values of Oct4 peaks (upper right) located in target genes promoter and its target genes expression in promoter (right bottom). (D) Genome browser view at the Erbb3, Sall1, Lama3 and Lin28a locus of Oct4 binding at the different time points of reprogramming, respectively. And the blue boxes represent the promoters of these genes and their surrounding regions. (For interpretation of the references to color in