Abstract Background Cell lines are an indispensable tool in biomedical research and often used as surrogates for tissues. Although there are recognized important cellular and transcriptomic differences between cell lines and tissues, a systematic overview of the differences between the regulatory processes of a cell line and those of its tissue of origin has not been conducted. The RNA-Seq data generated by the GTEx project is the first available data resource in which it is possible to perform a large-scale transcriptional and regulatory network analysis comparing cell lines with their tissues of origin. Results We compared 127 paired Epstein-Barr virus transformed lymphoblastoid cell lines (LCLs) and whole blood samples, and 244 paired primary fibroblast cell lines and skin samples. While gene expression analysis confirms that these cell lines carry the expression signatures of their primary tissues, albeit at reduced levels, network analysis indicates that expression changes are the cumulative result of many previously unreported alterations in transcription factor (TF) regulation. More specifically, cell cycle genes are over-expressed in cell lines compared to primary tissues, and this alteration in expression is a result of less repressive TF targeting. We confirmed these regulatory changes for four TFs, including SMAD5, using independent ChIP-seq data from ENCODE. Conclusions Our results provide novel insights into the regulatory mechanisms controlling the expression differences between cell lines and tissues. The strong changes in TF regulation that we observe suggest that network changes, in addition to transcriptional levels, should be considered when using cell lines as models for tissues. Electronic supplementary material The online version of this article (10.1186/s12864-017-4111-x) contains supplementary material, which is available to authorized users. Keywords: Regulatory networks, Transcriptome, GTEx, Lymphoblastoid cell lines, Fibroblast cell lines Background Cell lines are an essential tool in cellular and molecular biology, providing a lasting resource that can match a particular genotype and phenotype in a controllable and reproducible setting. Cell lines have accelerated the investigation of many biological processes, however despite their merits as an experimental system, cell lines do not capture tissue complexity and heterogeneity, mainly because they consist of a single cell type that is adapted to grow in culture and lacks interactions with other cell types, the extracellular matrix, or paracrine signaling [[43]1, [44]2]. Cellular heterogeneity is present even in seemingly homogenous groups of cells [[45]3]. While it has been previously reported that these and other factors can influence cell line gene expression [[46]4–[47]6], the differences in transcription factor (TF) regulation between cell lines and tissues have not been systematically studied. Regulatory network approaches can help elucidate the regulatory processes associated with the differences in expression observed in cell lines when compared to their tissues of origin. Standard transcriptomic analyses typically focus on studying the regulation and function of one or a few genes, and these approaches fail to characterize the complex cellular processes defined by the collective contribution of signaling pathways and cell-type specific regulators. On the other hand, TF regulatory networks provide an intuitive framework for characterizing the combinatorial regulatory effect of TFs on their target genes. These regulatory networks capture and quantitatively model the processes that drive cellular phenotype, with differences in network structure reflecting changes in regulatory processes. For example, in previous work experimentally interrogating the subnetwork around a few TFs it has been possible to uncover patterns of transcriptional regulation associated with cellular differentiation [[48]7], pluripotency [[49]8], and development [[50]9]. More recently, by integrating different types of genomic data it has been possible to model genome-wide regulatory networks [[51]10] and to identify distinct regulation patterns within different cell types [[52]11] or different disease states [[53]12–[54]14]. Many of these network algorithms rely on a large number of expression samples and, until now, regulatory networks have not been used to elucidate the regulatory process differences between cell lines and their tissues of origin mainly because of the lack of large data sets with paired samples. The Genotype-Tissue Expression (GTEx) project [[55]15] generated a large multi-subject data set that offers an unprecedented opportunity to understand how well a cell line’s regulatory processes recapitulate those of its tissue of origin. GTEx version 6.0 includes RNA-Seq data for 244 paired primary fibroblast cell lines and skin samples and 127 paired Epstein-Barr virus (EBV) transformed lymphoblastoid cell lines (LCLs) and whole blood samples. Primary fibroblasts are a type of finite cell line widely used as model systems because they are easily isolated and grown in culture, and almost never show genetic alterations in oncogenes or tumor suppressors [[56]16, [57]17]. LCLs are among the most widely created, archived, and analyzed continuous cell lines which, in contrast to finite cell lines, acquire the ability to proliferate indefinitely. LCLs have been extensively genotyped and sequenced as part of large collaborative projects, such as the International HapMap [[58]18], 1000 Genomes [[59]19], ENCODE [[60]20] and GTEx [[61]15] projects. Despite their widespread use, there has been concern about using LCLs to model primary tissues, with two small-scale studies finding differences in gene expression profiles between LCLs and primary B cells [[62]21, [63]22]. While these studies found genes differentially expressed in LCLs compared to B cells, for example the over-expression of cell cycle genes, the regulatory mechanisms associated with this differential expression are not known. Here we performed a detailed investigation of gene expression and gene regulatory networks using two cell line and tissue pairs, LCL-vs-blood and fibroblast-vs-skin, to understand the regulatory networks mediating expression differences between the cell lines and their tissues of origin. Although we find that many pathways are preserved between cell lines and their tissues of origin, some biological processes that help define the function of the primary tissue are enriched for genes expressed at lower levels. In addition, we find that LCLs and fibroblast cell lines exhibit large changes in their patterns of TF regulation. For example, while cell cycle genes are over-expressed in cell lines compared to their tissues of origin, they have an overall decrease in negative regulation by TFs that are known to function as repressors. These findings suggest that changes in network properties are useful for understanding alterations in gene regulation between cell lines and their tissues of origin. Results Pathways differentially expressed between cell lines and their tissues of origin The GTEx project collected post mortem biopsies from multiple tissues and created LCLs and fibroblast cell lines. For the analyses described here, we used only data from research subjects for whom primary tissue and matching cell lines were available. Data (version 6.0) were available for 127 paired whole blood samples and LCLs, and for 244 paired full-thickness skin biopsies and primary fibroblast cell lines [[64]15]; 89 subjects have data across all four groups. We did not find any clear separation of samples based on the year of analysis by the GTEx project (Additional file [65]1). The cell lines and tissues express similar numbers of genes mapped to similar functional categories (protein coding, antisense, pseudogene, lincRNA, and other; Additional file [66]1). Principal component analysis (PCA) showed that gene expression easily distinguishes the four groups (Additional file [67]1). The first principal component and the majority of the variability (37%) separated blood and LCLs from the skin and fibroblast samples. The second component (22%) separated tissues from cell lines. This indicates that while the samples separate based on their tissues of origin, there is also a significant separation between cell lines and primary tissues. The separation seen in the PCA remains robust when random samples are selected (Additional file [68]1). In order to quantify the variability present within each of these four groups of samples, we analyzed gene expression variability across all cell line and tissue groups and observed wider variability in gene expression within tissue samples compared to cell line samples (Additional file [69]2). We also used an f-test to evaluate the differences in gene expression variance between the groups. We found a higher percentage of genes with significantly greater variance in blood compared to LCL, and in skin compared to fibroblast (FDR < 0.05, Additional file [70]2). Also, fibroblast have a higher percentage of genes with greater variance than LCL, and blood have a higher percentage than skin. Although some gene expression variability can be attributed to stochastic processes, and tissue heterogeneity may contribute to these observed differences, expression variability is also strongly mediated by the genomic and epigenomic context. For instance, expression variability can be modulated in a tissue-specific fashion and determined by promoter binding affinity [[71]23]. To explore gene regulation differences between cell lines and their tissues of origin, we sought to identify both differentially expressed genes and the pathways associated with them, followed by an analysis of the drivers of the transcriptional changes through gene regulatory network analysis. We used voom [[72]24] and Gene Set Enrichment Analysis (GSEA) [[73]25] to identify biological pathways that are enriched in genes differentially expressed between cell lines and their tissues of origin. We found 8617 genes (32%) to be differentially expressed between LCL and blood samples (absolute log[2] fold change >2 and FDR < 0.05) with most of the differentially expressed genes (71%) over-expressed in LCL samples (Fig. [74]1a, Additional files [75]3 and [76]4). For the fibroblast-vs-skin comparison, we identified 5655 differentially expressed genes (21%). In contrast to the LCL-vs-blood comparison, most of the differentially expressed genes (68%) had increased expression in the primary tissue rather than in the cell line. Fig. 1. Fig. 1 [77]Open in a new tab Pathways are differentially expressed between cell lines and their tissues of origin. a Number of differentially expressed genes (absolute log[2] fold change >2 and FDR < 0.05) using voom on paired samples. b Results of GSEA reported based on the log[10](FDR) significance scale, with one group in red and the other one in blue. The 15 pathways most significantly differentially expressed between each cell line and its tissue of origin. c Pathways enriched for at least two group comparisons (FDR < 0.05). The pathways differentially expressed between the tissues that are also differentially expressed between the cell lines (preserved pathways) are highlighted in red and blue. Pathways over-expressed in both cell lines compared to their tissues of origin are highlighted in yellow. Rows are ordered by hierarchical clustering of the enrichment significance values, log[10](FDR). To represent the FDR significance in the heatmap, the color was saturated at 1.1 × 10^−4. The exact reported FDR can be found in Additional file [78]2 Using GSEA, with genes ranked by the moderated t-statistic from voom, we identified Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [[79]26] enriched for differentially expressed genes between cell lines and their tissues of origin. Consistent with the separation observed in the PCA, both cell lines exhibit enrichment for pathways with similar biological functions compared to their tissues of origin (Fig. [80]1b). While immune processes are down-regulated in cell lines, the pathways with positive enrichment are generally associated with cellular growth, and include cell cycle, DNA replication and repair, and transcription processes. When comparing blood to LCLs, we found that pathways enriched in blood were related to immune system function, including complement and coagulation cascades, hematopoietic cell lineage, chemokine signaling, and natural killer cell mediated cytotoxicity, while the pathways enriched in LCLs were associated with cell growth and death, DNA replication and repair, transcription, and metabolism (FDR < 0.05, Fig. [81]1b). Similarly, when comparing skin to fibroblasts, the pathways enriched in skin were related to the immune system, metabolism, cell adhesion, and melanogenesis, while the pathways enriched in fibroblasts were associated with cell growth and death, DNA replication and repair, transcription, and protein degradation (FDR < 0.05, Fig. [82]1b). The significance of all KEGG pathways is listed in Additional file [83]5. We also performed differential expression and KEGG pathway enrichment analysis comparing the two tissues and comparing the two cell lines (Fig. [84]1c). We found a number of immune signaling pathways enriched in blood compared to skin and in LCLs compared to fibroblasts. For example, pathways related to the biological function of B cells (B cell receptor signaling, toll-like receptor signaling, antigen processing and presentation) were enriched in LCL and blood samples when comparing them to fibroblast and skin samples, respectively. However, some immune related pathways, including chemokine signaling and natural killer cell mediated cytotoxicity, were also enriched in LCL and blood samples compared to fibroblast and skin samples, but they were expressed at lower levels in LCL compared to blood samples. As observed previously, these immune signaling pathways are also significantly depleted in fibroblasts compared to skin, which is a tissue with a key role in immunity and associated with many immune cell types [[85]27]. The pathways enriched in fibroblast and skin samples compared to LCL and blood samples, respectively, were associated with biological processes related to maintaining skin structure and organization and included cell-cell junction, extracellular matrix interaction, and transforming growth factor beta (TGF-β) signaling. We find that pathways related to cell cycle and DNA repair are enriched in cell lines compared to their tissues of origin, and more enriched in LCLs compared to fibroblasts. Overall, we found that the preserved pathways in cell lines are mainly related to the cell type specific functions (B cells or fibroblasts) rather than tissue-enriched functions. Further, many of the genes in pathways that help define the function of the tissue are expressed at a lower level in cell lines relative to their tissues of origin. Cell line and tissue-specific gene regulatory networks Understanding the structure of gene regulation in cell lines compared to their tissues of origin has the potential to help interpret the differential expression results and to reveal important regulatory differences. PANDA (Passing Attributes between Networks for Data Assimilation) is an approach that integrates multiple types of genomic data to infer the network of interactions between TFs and their target genes [[86]28]. In contrast to other network reconstruction approaches, PANDA searches for consistency across multiple sources of information in order to build a holistic regulatory model. The core of the PANDA algorithm is a message passing approach in which regulatory processes are modeled as a communication process between “transmitters” (TFs) and “receivers” (target genes). For communication to occur, both transmitters and receivers play an active role: TFs are responsible for regulating genes and the target genes must be available to be regulated. PANDA starts with a TF/target gene prior regulatory network consisting of potential routes for communication, which is built by mapping TFs motifs to the genome. PANDA integrates this prior network with protein-protein interaction (PPI) and gene expression data, using it to model TF cooperativity and gene co-expression, respectively. Based on this information, it then iteratively estimates the most likely routes of communication through the regulatory network. We used PANDA to estimate gene regulatory networks in LCL, blood, fibroblast, and skin (Additional file [87]6). For each network, we began with the same TF/target gene prior regulatory network and PPI prior network, but used “tissue”-specific gene expression data. This resulted in four gene regulatory networks where each edge connects a TF to a target gene, and the associated edge weight indicates the strength of the inferred regulatory relationship in that “tissue”. These networks can inform us about the genome-wide regulation of the cell lines and tissues analyzed as we compare 652 TFs, 27,175 target genes, and more than 17 million edges between them. We used bootstrapping to select random sets of RNA-Seq expression data to estimate the robustness of these network models, generating 100 random networks for each of the cell line or tissue groups (Additional file [88]6). We observed a high level of consistency across the bootstrapped networks; the average weight of the edges across these networks is highly similar to the weights of the edges in the network estimated using all samples (Pearson correlation ≥0.98). For each TF we computed the difference between the “out-degree” (sum of edge weights from that TF) in the cell line network and the corresponding tissue of origin network (Fig. [89]2a); these values are also highly robust across our bootstrapped networks (Additional file [90]6). We ranked TFs by their absolute difference in out-degree (differential targeting, Additional file [91]7) and found that TFs with the largest differential targeting were involved in cellular responses to stress and DNA damage and in the control of cellular growth (Fig. [92]2b, Additional file [93]8). For both the LCL-vs-blood and fibroblast-vs-skin comparisons, many of the top differentially-targeting TFs, such as tumor protein p63 (TP63), TOP1 binding arginine/serine rich protein (TOPORS), and Kruppel like factor 15 (KLF15), belong to the p53 family or interact with p53 and are important mediators of DNA damage response regulating cell cycle arrest, DNA repair and apoptosis [[94]29–[95]32]. Fig. 2. Fig. 2 [96]Open in a new tab Transcription factors differentially-targeting genes in cell lines and their tissues of origin. a Illustration of the TF out-degree difference between each cell line and its tissue of origin. Positive values indicate higher targeting in cell lines, and negative values indicate higher targeting in tissues. b Function of the TFs with the largest difference in out-degree comparing LCL-vs-blood; and fibroblast-vs-skin regulatory networks. The complete table with references and