Abstract The advent of high-throughput sequencing has made it possible to measure the expression of genes at relatively low cost. However, direct measurement of regulatory mechanisms, such as transcription factor (TF) activity is still not readily feasible in a high-throughput manner. Consequently, there is a need for computational approaches that can reliably estimate regulator activity from observable gene expression data. In this work, we present a noisy Boolean logic Bayesian model for TF activity inference from differential gene expression data and causal graphs. Our approach provides a flexible framework to incorporate biologically motivated TF–gene regulation logic models. Using simulations and controlled over-expression experiments in cell cultures, we demonstrate that our method can accurately identify TF activity. Moreover, we apply our method to bulk and single cell transcriptomics measurements to investigate transcriptional regulation of fibroblast phenotypic plasticity. Finally, to facilitate usage, we provide user-friendly software packages and a web-interface to query TF activity from user input differential gene expression data: [31]https://umbibio.math.umb.edu/nlbayes/. Introduction Gene regulation plays an essential role in many cellular processes, including metabolism, signal transduction, development and cell fate ([32]1,[33]2). At the transcriptional level, regulation of genes is orchestrated by the concerted action between transcription factors (TFs), histone modifiers, and distal cis-regulatory elements to finely tune and modulate expression of genes ([34]3). Sequence specific TFs, which have affinity for specific DNA sequences, may bind to cis-regulatory elements at the enhancer or promoter region of genes to either activate (upregulate) or repress (downregulate) the expression of genes. Aberration in TF activity and the dysregulation of target genes have been implicated in many pathological states and human disease ([35]4). Activity of TFs can be triggered downstream of signaling events, which in turn may be activated in response to environmental and molecular perturbations ([36]5). Perturbations in TF activity often result in modulation of gene expression. The technological advancements in high-throughput sequencing have made it possible to measure expression of genes at relatively low cost. However, direct measurement of regulatory mechanisms, such as TF protein expression and functional activity in a high-throughput manner is still not readily available. Consequently, there is a need for computational approaches that can identify active regulatory mechanisms from observable gene expression data. The scientific community has proposed several computational algorithms and biophysical models to study the impact of TF activity on gene expression. Some of these algorithms use statistical and probabilistic approaches to infer TF activity and dynamics directly from gene expression data ([37]6–11), and more recently ([38]12–17), while others rely on biophysical approaches to model expression of genes based on known TF–gene interactions ([39]18,[40]19). Boolean networks and probabilistic extensions have also been used to model gene regulation ([41]9,[42]20–23). In logic models, genes are assumed to be either ON or OFF and Boolean logic (AND, OR, NOR, etc.) is utilized to model combinatorial regulation. For example, Bulashevska and Eils ([43]9) introduced a Bayesian approach to generalize the Boolean logic to incorporate noise and utilized their approach to reconstruct gene regulatory networks in yeast. Another class of algorithms use prior biological knowledge on biomolecular interactions to link a differential gene expression (DGE) profile to upstream regulators (e.g. TFs) ([44]24–30). The essential ingredients of these type of algorithms are (i) an input DGE profile, (ii) a network of biomolecular interactions or pre-defined gene sets and (iii) an inference algorithm to query the network. The output is a set of candidate regulators, pathways, or biological processes with associated probabilities or significance P-values. The DGE profile as obtained from RNA-Seq or microarrays studies is the observable input that quantifies the difference in transcript abundance between two conditions (e.g. healthy versus disease, stimulated versus not stimulated, etc.). The network of biomolecular interactions encapsulates the prior biological knowledge. The inference algorithms typically map the DGE profile to the network to identify drivers (nodes, terms, or paths in the graph) of the observed transcriptional changes. SI Table [45]2 lists a few representative methods for inference of TF activity and reconstruction of gene regulatory networks and their corresponding descriptions ([46]12,[47]14,[48]15,[49]25,[50]30–34). Table 2. Gene markers used for classification of single cells in the prostate tissue single cell dataset by Henry et al. ([51]58) Tissue type Gene markers Basal Epithelia KRT14, DST, KRT15, KRT5, RGCC Luminal Epithelia MSMB, KLK3, ACPP, PLA2G2A, KLK2 Other Epithelia 1 SCGB3A1, LCN2, PIGR, WFDC2, FCGBP Other Epithelia 2 KRT13, APOBEC3A, CSTB, LYPD3, SERPINB1 Fibroblast APOD, FBLN1, PTGDS, CFD, DCN Smooth muscle TPM2, ACTA2, RGS5, MT1A, MYH11 [52]Open in a new tab These are the same markers used in that study. Despite the popularity and success of these methods, several challenges remain to be sufficiently addressed. For example, biophysical models are computationally expensive and are suitable for small scale applications or simulation studies. Boolean logic models, although simple to implement, cannot directly account for noise in gene expression data. On the other hand, probabilistic models for inference of gene regulatory networks typically overlook the context of experiment. Regulatory networks may be noisy or contain interactions that are applicable in a specific context only. Properly modeling any dependencies on the biological context in active regulator inference and enrichment analysis algorithms can lead to more accurate inference of the regulatory mechanisms specific to that context. Utilizing causality and information on mode of regulation (activation versus repression) can also significantly reduce false positive predictions, resulting in more interpretable models. Moreover, biologically motivated TF–gene interaction logic models (e.g. combinatorial effects of activators and repressors on gene expression) must be taken into consideration when inferring transcriptional regulatory programs. To address these challenges, we have developed a Noisy-Logic Bayesian (NLBayes) TF activity inference model that accounts for these factors in a unified manner. Given an input DGE profile, our model incorporates the prior information on causal regulatory interactions and makes posterior adjustments to further account for noise and determine the context-specific posterior network structure and active regulators through a Gibbs sampling procedure. We evaluate the performance of our model using simulation studies as well as over-expression datasets and demonstrate that our method can accurately identify active transcriptional regulators from gene expression data and causal graphs. We benchmark our algorithm against VIPER, a closely related method that is widely used for identification of regulon activity ([53]15). Both methods are able to identify relevant TFs in the corresponding experiments, with several TFs identified by both methods at the intersection, indicating that the algorithms complement each other. Our method can be used for novel biological discoveries. To illustrate this, we apply our method to investigate transcriptional regulation of fibroblast phenotypic plasticity in response to signaling molecules TGFβ and CXCL12. This study utilizes differential expression profiles from bulk RNA-Seq experiment using prostate cell lines ([54]35), stimulated with TGFβ and CXCL12. Our analysis recovers several TFs, including YAP1 as the top prediction, which has been identified as a driver of myofibroblast differentiation in multiple tissue phenotypes ([55]36–45). Additionally, we present new single cell gene expression data from the same prostate cell lines as well as three additional human prostate fibroblast cell lines. We characterize the cell lines at the transcriptional level and apply our algorithm to identify major transcriptional regulators in each cell line and study the impact of immortalization on transcriptional regulation. Our algorithm provides a general framework and a widely applicable tool to study transcriptional regulators of differential gene expression. To facilitate wider use, we provide R and Python packages, and a web-interface for running inference experiments. Figure [56]1 summarizes the overall approach. Figure 1. [57]Figure 1. [58]Open in a new tab Schematic diagram of the NLBayes inference pipeline. The starting point is gene expression data from two conditions, from which a differential gene expression profile is calculated by discretizing gene values to –1 (down regulated), 0 (not regulated) or 1 (up regulated) using cutoff thresholds on P-values and/or foldchange. A TF–gene interaction network is used to build the graphical model. Values for gene nodes are populated from the differential expression profile. NLBayes runs a probabilistic query on the causal network and outputs the posterior distribution of TFs, from which the activation state of the TFs is determined. Materials and methods Noisy logic-based gene regulation graphical model As a starting point, we construct a causal graph from the TF–gene interaction network as follows. The causal graph is a triplet Inline graphic , where Inline graphic represent the nodes, Inline graphic represent the edges (pairs of nodes) and Inline graphic represent signs associated with each edge Inline graphic . Figure [59]2 shows a graphical representation of the proposed model. The nodes in the graph consist of the following layers: Figure 2. [60]Figure 2. [61]Open in a new tab The proposed graphical model for each interaction Inline graphic . * Transcript nodes Inline graphic : These are the terminal nodes in the network and represent the transcripts. The domain of these nodes is Inline graphic , representing downregulated, not regulated and upregulated respectively. The value of these nodes will be populated from the input gene expression data. * True states nodes Inline graphic : These nodes represent the true unobserved state of the transcript nodes, and are incorporated to account for noise in input data. These nodes have domain Inline graphic and are central in the implementation of noisy logic gates. * Regulator state nodes Inline graphic : These nodes represent the activation state of TFs in the network. Here, we use Inline graphic , for no activation or activation respectively. * TF activity noise nodes Inline graphic . For each Inline graphic in the network, we assign a node Inline graphic , representing a continuous random variable with domain Inline graphic . These nodes represent the probability of activation for the corresponding node Inline graphic and are modeled by a beta distribution. * Mode of regulation Inline graphic . These nodes represent the mode of action (activation versus repression) between parent TF node Inline graphic and true state node Inline graphic . These nodes have a domain Inline graphic , representing inhibition (I), non-applicable (NA) and activation (A) respectively. We use one-hot encoding for this variable, i.e. Inline graphic is a vector of size 3 with components Inline graphic . Transcriptional logic In our model, we incorporate logic gates as in ([62]9) to explicitly account for combinatorial effects using Boolean logic while accounting for uncertainty. In this work, we consider a combination of noisy OR and NOR gates. We consider two models as follows. OR model This model is used to describe the likelihood of downregulation of a gene by a set of TFs. In this model, the presence of one active inhibitor is sufficient to downregulate the gene. The probability mass function is modeled as a Bernoulli trial with probability of success graphic file with name M00023.gif where Inline graphic represents all model parameters involving Inline graphic nodes. Although this model seems like a sensible choice, it assumes that all the genes targeted by a TF should strictly follow its influence. However, target regulation depends on many other factors, and we should expect only a fraction of targets to be effectively regulated by a given TF. To make our model more flexible, we incorporate a hyper-parameter Inline graphic , that allows the likelihood model to tolerate more zero-genes in the evidence data. The OR model above is now graphic file with name M00027.gif where Inline graphic represents a prior probability of finding a downregulated gene in the evidence data. For convenience, let us define Inline graphic , and let Inline graphic and Inline graphic denote two different values for the Inline graphic parameter that are set depending on whether the current gene has been observed as modulated or not, respectively. The value of the parameter Inline graphic is set close but greater than zero, for non-differentially expressed genes, e.g. Inline graphic , while for differentially expressed genes it may be set close or equal to one. Additionally, an extra Inline graphic term is now multiplying Inline graphic , effectively increasing the sensitivity of these TF–gene interactions. This has proven beneficial to improve specificity of the inference results. We set Inline graphic and Inline graphic is set to be proportional to Inline graphic , i.e. the number of edges in the network that point to genes that are observed as differentially expressed. More specifically, we have used the following relation: graphic file with name M00040.gif where Inline graphic is the total number of edges in the network. OR-NOR model for gene activation This model offers a relatively simple way for describing combinatorial effects of both up- and down-regulation within the same interaction network. The rationale is that for a target gene to be activated, at least one of its upstream activators must be activated (OR gate), while at the same time none of its inhibitors is active (NOR gate). In this case the target gene up-regulation event is modeled as a Bernoulli trial with probability of success graphic file with name M00042.gif where Inline graphic is the prior probability of finding an upregulated gene in the evidence data. Target gene activation state is regarded as a multinomial trial with three possible outcomes: upregulation, downregulation, or not changed. This is congruent with discretized differential expression data and allows building a complete model likelihood. Correspondingly, the complementary outcome likelihood is then represented by a NOR–NOR model. graphic file with name M00044.gif Here, Inline graphic is the prior probability of finding a non-differentially expressed gene in the observed data. The model likelihood The posterior probability of model parameters given the observed data is given by: graphic file with name M00046.gif where Inline graphic is the conditional probability of the observed expression value Inline graphic given the true state. This conditional probability models the true positive and false positive rate in input gene expression data. Table [63]1 shows the values used in our implementation. These values are estimations based on typical experimental errors. This component of the model may encompass several sources of uncertainty, such as dropped reads during RNA sequencing or type I errors in the statistical analysis made for computation of differential gene expression. This may be treated as a prior probability, representing our belief that we would observe these rates. Table 1. Conditional probability of False positive and False negative in the true state given the observed state Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 0.945 0.050 0.005 Inline graphic 0.050 0.900 0.050 Inline graphic 0.005 0.050 0.945 [64]Open in a new tab Fitting the model The next task is to find the set of model parameters that maximize the model likelihood. Given the large scale of the parameter space, this problem is intractable analytically. A widely used approach for inferring posterior probability in large scale Bayesian networks is Markov Chain Monte Carlo (MCMC) sampling. In particular, Gibbs sampling is a suitable MCMC method to approximate the posterior distribution of the model parameters given the observed data. In Gibbs sampling, we sequentially sample from each random variable, conditioned on the current state of its Markov blanket. To assess the convergence to the posterior distribution of the model parameters, we run at least three independent sampling chains and periodically compute the Gelman–Rubin statistic for each random variable ([65]46). We stop the process after this diagnostic statistic is less than 1.1 for every random variable in the model. The core of the sampling algorithm has been implemented using C++, and user-friendly R and Python packages developed. Algorithm In this section, we present Algorithm 1, which was used throughout this work, along with a detailed description of each of the steps. Algorithm 1. Sample model's posterior distribution. graphic file with name lqad106figu1.jpg [66]Open in a new tab * Step 1: Data preprocessing. The input differential gene expression data should contain the computed P-values and fold-change scores for all genes. In this step, we select thresholds for P-values and fold change. The aim is to limit the input data to include only the most significant DEGs, totaling to ⇐ 800 as a rule of thumb. * Step 2: Populate the causal graph. Here we assign the observed values (evidence) for the differential expression of each gene to the corresponding Inline graphic nodes in the graph. These values will remain fixed during the sampling process. * Step 3: We create Inline graphic independent copies of the graph, allowing us to store separate states of Inline graphic Markov chains to sample. The use of multiple Markov chains provides a way to compare independent results and enables early stopping. Ideally, all Inline graphic chains should converge to the same posterior distribution for all random variables in the model. * Step 4: Gibbs sampling. For each random variable we retrieve the current state of its Markov blanket, which is given by the current values of the children nodes, the parent nodes, and the children of the parent nodes. Then, given this Markov blanket, we compute the conditioned probability of the random variable, from which a new sample is drawn. The sampled value is stored and we move the next random variable and repeat the process. After completing the sampling process for all variables, we start over from the first random variable. This process is repeated for each of the Inline graphic graph replicas created in Step 3. To assess convergence of the posterior distribution, we pause the sampling process after completing Inline graphic rounds to compute the Gelman-Rubin statistics described in the next Step of the algorithm. The choice of k, the period in which convergence is checked, is arbitrary and has been set to Inline graphic in our implementation. * Step 5: Maximum Gelman–Rubin statistic computation. During the Markov chains sampling process, we need to determine whether the sampled distributions for the random variables in the model have converged to their respective posterior distributions. For this purpose, we compare the Inline graphic Markov chain distributions for every single variable in the model by means of the Gelman–Rubin statistic Inline graphic , which is a dimensionless score that combines the between chain and within chain variances of the random variables. The statistic approaches Inline graphic as the sampled distributions converge to a steady state. This score is computed for all random variables and we keep the maximum value obtained. * Step 6: Convergence assessment. Convergence of the sampling process is called when Inline graphic for every random variable. If this condition has not been satisfied, we continue the sampling process in Step 4. * Step 7: At this point the sampling process is complete, and the mean value for each random variable is returned. Simulations To assess the impact of noise on model performance, we simulated interaction networks and input differential gene expression data (see section: Results, Simulation studies). First, a random interaction network was generated by selecting a total number of Inline graphic TFs and a total number of Inline graphic genes in the network. For each TF, its number of target genes was sampled according to a negative binomial distribution. Target genes were assigned randomly using a uniform distribution among all genes in the network. Finally for each interaction edge a sign of regulation was assigned with 0.65 probability of being activation (+) and 0.35 probability for repression (–). Subsequently differential gene expression data was simulated as follows. First, we select a random set of TFs that are assigned as active. This set of TFs constitutes the ground truth. For each active TF, we select 10% of its target genes and assign differential expression according to the sign of regulation as predicted by the graph: +1 if target is upregulated by TF, –1 if downregulated. For genes targeted by multiple active TFs, we perform the algebraic sum of all incoming interactions and take the net sign. The parameters for network simulation were chosen to mimic some characteristics of the biologically derived ‘three-tissue network’ described below. These parameters include the proportion of TFs to Genes, the node degree distribution for TFs, and the average number of TFs that target the same gene. To assess the robustness of our method against the network topology, we generated several additional networks with different combinations of these parameters (See [67]Supplementary Information File, [68]Supplementary Table S4). TF–gene interaction networks The TF–gene interaction network was obtained from ([69]30) in which interaction network were constructed from direct experimental evidence, integrating data ([70]47) from ChIP-Atlas and The Genotype-Tissue Expression (GTEx) databases ([71]48,[72]49). Integration was achieved through a regularized Gaussian Graphical model that softly integrated TF–gene interactions derived from ChIP-Seq data into gene expression derived from tissues, resulting in 15 tissue specific TF–gene interaction networks as well as a ‘merged network’ obtained by overlapping tissue-specific networks. In the present work we use the merged network containing interactions that are consistent in at least three tissue types, resulting in 338680 TF–gene interactions from 750 TF molecules (see section: Results, Over-expression datasets). For the benchmark experiments (see section: Results, Benchmarks), we used the Human Breast Carcinoma context specific network termed ‘regulonbrca’, that has been constructed with ARACNe and made available by Lachmann et.al. ([73]33). The regulon object has been filtered to retain only transcription factor molecules, similar to the approach followed by the authors of VIPER ([74]15): DNA-binding transcription factors (GO:0003700), and DNA-binding molecules (GO:0003677) annotated as ‘transcription regulator activity’ (GO:0140110) or ‘regulation of DNA-templated transcription’ (GO:0006355). This results in a network with 1693 TF molecules and 200 336 TF–gene interactions. For use in the NLBayes inference, we further filtered out interaction edges with likelihood <0.5, resulting in 73 639 TF–gene interactions. Differential gene expression data DNA microarray-based gene expression profiles for over-expression studies in human primary mammary epithelial cell cultures ([75]50), were obtained from the GEO repository ([76]GSE3151) by using the GEO2R tool for sample selection ([77]51). Differential expression was computed using the limma R package ([78]52). We limited the number of differentially expressed genes by applying cutoff thresholds for the adjusted P-values ( Inline graphic ) and log2-foldchange ( Inline graphic ) for E2F3 and MYC datasets. For the RAS dataset these cutoff values produced 2226 differentially expressed genes. To further limit the number of DEGs in this experiment, we increased the log[2]-foldchange cutoff threshold to 2. Additionally, we used data from a study on fibroblast-to-myofibroblast phenotypic conversion in response to pro-fibrotic signaling molecules TGFβ and CXCL12 ([79]35,[80]53,[81]54) (see section: Results, Fibroblast phenotypic plasticity). Differential expression for TGFβ and CXCL12 treated fibroblasts were generated using the R package edgeR ([82]55). Single cell RNA sequencing Following Trypsin digestion, cells were collected in a 50mL conical tube and washed and resuspended in PBS to a final concentration of ∼700 cells/ul. After gentle resuspension, 2.3 ul (∼1610 cells) per sample was combined with nuclease free water and master mix per 10× recommendations for a targeted recovery of ∼1000 cells per sample. The samples were loaded onto a 10× Chromium Chip A (PN230027, deprecated) (N1 and SFT1) or Chip G (PN 2000177) (pHPF and iHPF) and run through the 10X Chromium controller instrument and manufacturer protocols for RNA recovery and library preparation. Briefly, cells were partitioned into individual lipid droplets, lysed to release RNA and tagged with UMIs (unique molecular identifiers) for cell of origin. mRNA was isolated with dT oligo beads then reverse transcribed to DNA, ligated with Illumina-compatible sequencing adapters with multiplex capable barcodes (PN-120262 for N1/SFT1 (deprecated), PN-1000213 for pHPF/iHPF), for sample of origin, and PCR amplified for 13 cycles. N1 and SFT1 samples were prepared with Chromium v2 Library and gel bead kit (PN-120267, deprecated) while pHPF and iHPF were prepared with v3 (PN 1000128). Following library preparation, samples were assessed by Agilent 2100 Bioanalyzer using High Sensitivity chips and reagents (PN 5067–4626) to confirm a normal size distribution to minimize bias, quantified by qPCR with Illumina adapter compatible primers and Sybr Green (Kapa ROX Low Universal Library Quant kit PN KK4873) and molarity calculated by size-correcting to the bioanalyzer average size. N1 and SFT1 samples were pooled together while pHPF and iHPF were pooled together separately. The samples were sequenced on a Hiseq 2500 in Rapid Run mode in the CPCT Genomics Core, using paired-end on board clustering (PN PE-402-4002) and sequencing by synthesis (SBS) reagents (PN FC-402-4021). Twenty-eight bases were sequenced in read 1 to capture the UMIs, 8 bases for the single indexes, and 91 bases in Read 2 to capture transcripts, yielding ∼100 million total sequencing reads per sample. RNA-Seq alignment Reads were aligned to human genome version GRCh38 using the 10× cellranger v4.0.0 pipeline (cellranger mkref, and cellranger count) using default parameters. Over 93% of reads were mapped to the genome for all 4 datasets. Mapping rates were 93.6%, 93.3%, 96.3% and 96.3% for N1, SFT1, iHPF and pHPF, respectively. To improve the quality of the data, the resulting count matrices were reanalyzed to force the number of cells accepted, to those with highest UMI counts. Single cell data analysis Count data from single cell alignment were processed using the Seurat R package ([83]56,[84]57). Seurat objects were created from the filtered matrices resulting from the alignment step. Low quality cells were filtered out, by removing those with large mitochondria contamination and cells with either too few or too many unique genes or total RNA count. Cell cycle scores were assigned to each cell with the method CellCycleScoring using default parameters and cell cycle genes provided by Seurat package. Datasets N1, SFT1, pHPF and iHPF were combined by using Seurat's merge function and normalized with SCTransform using 3000 variable features and no centering. Batch effect was removed by considering the number of genes, RNA counts and mitochondrial RNA contamination as unwanted sources of variation. Dimensionality reduction was performed through principal component analysis (PCA) and Uniform Manifold Approximation & Projection (UMAP) as implemented in Seurat package, with functions RunPCA (30 PCs) and RunUMAP respectively. Cells were then filtered to work with G1 cells only. Differential expression was computed with respect to pHPF, by using the FindMarkers method. All Seurat methods were used with default parameters, unless otherwise stated. Prostate tissue single cell data was retrieved from ([85]58). This corresponds to FACS sorted cells, containing fibroblasts, smooth muscle, endothelial and epithelial cells. Here we ignore FACS sorting labels as we run our own classification process. This dataset was preprocessed using Seurat and its SCTransform pipeline, same as with cell lines data. Cells were clustered by using methods FindNeighbors and FindClusters. Each cluster was classified by looking into cell type markers taken from ([86]58), to assign labels: fibroblast, smooth muscle, endothelia, basal epithelia, luminal epithelia, and other epithelia 1 and 2. Cells in G1 cell cycle phase were retained for downstream analysis. Table [87]2 shows the markers used for the tissue cells classification. Cell lines data and tissue data were combined by using Seurat's integration pipeline with SCTransform, using 3000 features, k.anchor = 6, and reduction = ‘rpca’. Results To test our approach, we performed a series of experiments including simulations studies, benchmarks against an alternative approach, and inference of TF activity on novel datasets. The core of the algorithm uses an OR-NOR transcriptional regulation logic to predict TF activity. In this logic two conditions must be satisfied for gene activation: 1) at least one activator is targeting the gene and 2) no inhibitor is targeting the gene. On the other hand, for down regulation of genes we use a simple OR transcriptional regulation logic model: at least one inhibitor must target the genes. The algorithm outputs posterior probabilities for each TF activation state. The prior probability for TF activation is set to a small value ( Inline graphic ), and as such we consider a TF with posterior probability Inline graphic as potentially relevant. We then define three thresholds to classify inferred active TFs: High-confidence ( Inline graphic ) Mid-confidence ( Inline graphic ), and Low-confidence ( Inline graphic ). Simulation studies We performed several simulation studies to assess the ability of our algorithm in recovering active transcriptional regulators from gene expression data and to test the robustness of the inference process to noise in gene expression data and the causal graph of TF–gene interactions. For this analysis we generated a random interaction network consisting of 250 TFs and 5000 downstream target genes. The downstream targets of each TF were picked at random from a binomial distribution, resulting in ∼30000 interactions (edges) in the network. The edges of the network were randomly set as activation (65%) and inhibition (35%). We randomly selected 10 TFs and assigned them as active (the ground truth) and simulated downstream differential gene expression data by assigning + 1 or -1 to 10% of the target genes according to the causal graph. For genes targeted by multiple active TFs, we calculated the algebraic sum of all incoming interactions and took the net sign. This produced an average of 120 differentially expressed genes. Each experiment was repeated 20 times. Impact of data randomization For this analysis, we randomized a fraction of the input data (0, 0.25, 0.5, 0.75, 1.00) by randomly toggling the values. At 0% the data is not randomized, while at 100% the input data is completely random. The inference procedure was run on each input data set. Figure [88]3A and [89]B shows the ROC and precision-recall curves for each randomization experiment respectively. Figure [90]3E illustrates the sampled posterior distributions for active TFs at different randomization levels. As expected, at 100% the model fails to recover active TFs. However, up to 75% randomization the algorithm is still able to recover active TFs, with significant increase in accuracy at lower levels of noise. Figure 3. [91]Figure 3. [92]Open in a new tab Performance evaluation of the model before and after randomization of data. ROC curves and Precision vs Recall curves are shown for randomization simulation in input gene expression (A, B) and TF–gene interaction network (C, D). p indicates the corresponding fraction of randomization used. AUC scores are displayed in the legend. (E) The posterior distributions for TF activity is shown. The colors indicate the ground truth, i.e. whether the TF was set to active in the simulation. (F) The impact of the number of target genes on the inference results for the case of 25% randomization of data. To assess the impact of the number of target genes regulated by a TF on the posterior probability of the TF’s activity, we ran multiple simulations and plotted the posterior probability vs the number of genes regulated by TFs, color coded by TF activity (Figure [93]3F). For these simulations, we used noisy data at 25% randomization. We observe that for active TFs with less than 30 target genes, the posterior probability is low. This is expected as in these simulations, only 10% of target genes are set as differentially expressed, yielding an average of only three target genes that are modulated. This information is too small to shift the posterior probability. However, as the total number of target genes increase, we observe a threshold effect, where posterior probabilities stabilize, and the total number of targets does not have an impact on the inference. These experiments were repeated for several other randomly generated networks with consistent results (SI File [94]Supplementary Table S4). Impact of network randomization For this analysis, we select a fraction (0, 0.25, 0.5, 0.75, 1.00) of the edges in the network and randomly reassign them to different target genes. The inference was run using unperturbed input gene expression data and randomized networks. Figure [95]3C and [96]D shows the ROC and precision-recall curves for each randomization experiment. A similar picture as in data randomization emerges. These results demonstrate the robustness of our algorithm at 25% randomization level in both the gene expression data as well as the input network, while still retaining some prediction power at the 50% and 75% randomization levels. Over-expression datasets To test the ability of our algorithm in recovering known active regulators, we used three publicly available over-expression datasets ([97]GSE3151), performed on human primary mammary epithelial cell cultures, each generated by over-expression of an oncogene: E2F3, c-Myc and H-Ras ([98]50). For these inference experiments, we utilized a TF–gene interaction network generated by Farahmand et al. ([99]30). The network was generated using several high-throughput datasets, and a Gaussian Graphical Model. We chose to use this network as it showed consistently high predictive power across several datasets ([100]30). In each experiment, differentially expressed genes were determined compared to the control sample and inputted into the TF activity inference algorithm. Tables [101]3 and [102]4 summarize the inference results. Inferred active regulators are split into three categories based on posterior probability Inline graphic of inferred activity: High-confidence ( Inline graphic ) Mid-confidence ( Inline graphic ), and Low-confidence ( Inline graphic ). Percentage of differentially expressed genes targeted (explained) by at least one inferred active regulator is also presented in the table. Table 3. Predicted active TFs from 3 over-expression experiments: E2F, MYC and RAS E2F3 c-MYC H-RAS Name Enrichment Inference Name Enrichment Inference Name Enrichment Inference E2F1 3.7E-06 1.00 TCF3 4.9E-01 1.00 E2F3 6.1E-01 1.00 FLI1 2.7E-02 1.00 HDAC2 1.0E + 00 0.90 PPARG 4.0E-01 1.00 FOXA2 7.0E-01 1.00 TP53 1.2E-02 0.85 TEAD4 4.9E-01 1.00 HOXA4 7.0E-01 1.00 − − − KLF5 2.2E-02 1.00 MXI1 2.6E-01 1.00 − − − FOSL1 2.4E-06 1.00 BCOR 2.2E-02 1.00 − − − RELA 1.4E-01 1.00 AR 1.4E-02 1.00 − − − ASCL1 1.0E + 00 0.99 FOXF1 7.9E-01 1.00 − − − STAT3 6.1E-01 0.99 TCF3 8.6E-01 1.00 − − − PRKDC 1.0E + 00 0.97 MBD3 1.0E + 00 0.99 − − − JUND 1.4E-01 0.97 SOX2 7.3E-01 0.97 − − − NFKB2 1.2E-02 0.96 − − − − − − SOX2 5.3E-01 0.96 − − − − − − CTCFL 1.0E + 00 0.94 − − − − − − GATA3 9.0E-01 0.92 − − − − − − KDM6B 3.2E-03 0.86 HNF1B 1.0E + 00 0.77 SIX2 1.0E + 00 0.77 MYC 2.1E-01 0.75 FOXM1 1.7E-02 0.76 ETV1 1.0E + 00 0.73 NME2 1.0E + 00 0.66 NR5A2 9.2E-01 0.72 KLF11 1.0E + 00 0.54 ESRRA 1.0E + 00 0.65 HNF4A 1.0E + 00 0.71 ILF3 2.9E-01 0.51 SREBF1 1.8E-01 0.64 POU5F1 7.0E-01 0.69 − − − MBD2 9.7E-01 0.64 GATA3 1.5E-01 0.68 − − − − − − TCF21 1.0E + 00 0.50 − − − − − − PGR 8.6E-01 0.39 PAX5 1.0E + 00 0.49 PBX4 1.0E + 00 0.41 TEAD1 1.0E + 00 0.39 TFAP4 2.0E-01 0.49 HOXB13 1.0E + 00 0.37 GATA6 1.0E + 00 0.39 ZC3H8 5.0E-01 0.45 FOXF1 3.6E-03 0.36 VEZF1 1.0E + 00 0.30 PPARG 7.6E-01 0.44 KLF6 1.8E-02 0.34 EGR1 7.0E-01 0.29 PDX1 6.9E-01 0.42 BCOR 9.7E-01 0.33 NKX2-1 1.0E + 00 0.29 TEAD4 2.9E-01 0.37 HNF1B 1.0E + 00 0.32 HOXA9 4.3E-01 0.26 BATF 7.3E-01 0.37 SRF 9.0E-01 0.31 HOXC5 1.0E + 00 0.26 THAP11 1.0E + 00 0.36 GATA4 1.0E + 00 0.27 − − − SMARCC2 1.0E + 00 0.33 CEBPB 4.1E-02 0.24 − − − NCOA1 1.0E + 00 0.33 MAFF 3.6E-03 0.22 − − − TRIM28 2.9E-01 0.24 − − − − − − MYB 1.0E + 00 0.21 − − − [103]Open in a new tab The panel lists the inferred regulators, split into three categories, high-confidence (top), mid-confidence (middle) and low-confidence (bottom). For each TF, enrichment scores are computed from the number of target genes that are differentially expressed. Table 4. Summary of over-expression experiments E2F3 c-MYC H-RAS max P-value = 0.01 max P-value = 0.01 max P-value = 0.01 min abs(log2 fold-change) = 1.00 min abs(log[2] fold-change) = 1.00 min abs(log[2] fold-change) = 2.00 DEG Increased Decreased DEG Increased Decreased DEG Increased Decreased 418 321 97 127 75 52 409 163 246 Confidence level DEG explained Confidence level DEG explained Confidence level DEG explained 0.80 53% 0.80 32% 0.80 58% 0.50 61% 0.50 49% 0.50 64% 0.20 65% 0.20 70% 0.20 71% [104]Open in a new tab Top: Thresholds for P-values and log-foldchange for each experiment are shown, and their respective total number of Differentially Expressed Genes (DEGs) and the number of genes with increased and decreased RNA expression. For the RAS over-expression experiment, a higher log-foldchange threshold was used to limit the number of DEGs for inference. Bottom: proportion of explained DEGs by at least one inferred regulator at the indicated confidence level. For the E2F3 expression data, the E2F1 is returned as the top regulator. E2F1 and E2F3 have a similar function in control of the cell cycle and are similarly implicated in cancer ([105]59). E2F3 also regulates expression of FLI1, an ETS domain transcription factor and proto-oncogene ([106]60), as well as the FOXA2 transcription factor that promotes aggressive prostate cancer ([107]61) and HOXA4, a transcription factor important for embryonic development but often over-expressed in human colorectal ([108]62) and ovarian ([109]63) cancers. The MYC and RAS oncogenic proteins transcriptionally activate multiple genes associated with tumor progression and prognosis. MYC transcriptionally activates TCF3, HDAC2 and TP53. The TCF3 transcription factor was recently found to promote gastric ([110]64) and endometrial ([111]65) cancers, among others. HDAC2 has been reported to promote metastasis in pancreatic ([112]66) and breast ([113]67) cancers. MYC also regulates transcription of the well-known TP53 tumor suppressor gene. RAS proteins activate transcription of E2F3 itself, as well as PPARG and TEAD4. Modulation of PPARG activity has been intensively examined as an anti-cancer therapeutic target ([114]68), and TEAD4 which is known to modulate different cellular processes in cancer via its transcriptional output ([115]69). Taken together, these results demonstrate that the algorithm can accurately detect modulated transcriptional signals from DNA binding proteins. Benchmarks We compared the performance of our algorithm against VIPER ([116]15), a widely used method for inference of regulon activity from input gene expression data and causal graphs. For this benchmark, we used the Human Breast Carcinoma context specific network from ARACNe interactome, named ‘regulonbrca’ ([117]33). This network is appropriate for the overexpression datasets as they used human breast cell cultures and allows a fair comparison. Figure [118]3 summarizes the overlap between the two algorithms. For all three experiments, a fisher exact test reveals positive correlation between NLBayes and VIPER with P-values 0.034, 0.001 and 0.09 for the E2F3, MYC and H-RAS overexpression experiments respectively. Overall, there is good agreement between the methods as well as regulators recovered only by one algorithm, demonstrating the viability of both methods in recovering modulated TFs. Each algorithm predicts TF activity that is not shared by the other algorithm. While individually predicted TFs by each method may be important and can provide useful clues to the underlying biology, there is higher confidence in predictions at the intersection of both methods. As an additional benchmark, we have compared the inference results by NLBayes and VIPER on the MYC overexpression experiment, with two other methods for TF activity inference: Gene Set Enrichment Analysis (GSEA) ([119]26,[120]70) and Univariate Linear Model (ULM) ([121]70). To run these two methods, we used the DecoupleR Python package ([122]70), to which we provided the log-fold changes of the corresponding overexpression experiment. Genes with differential expression P-values over 0.05 were set as not differentially expressed (fold-change set to 0). Resulting P-values for each inference method were FDR corrected and TFs with adjusted P-values below 0.05 were labeled as active. Figure [123]5 highlights the overlaps between the 4 different methods. Subsets with TFs inferred active by NLBayes are shown in blue. Here we observe that of the 23 TFs inferred as active by NLBayes, 12 are supported by at least one other method. This analysis illustrates how NLBayes tends to provide a shorter list of active TFs when compared to other methods. This is because the proposed method seeks to find only the best combination of TFs that explain the observed differential expression pattern. Figure 5. [124]Figure 5. [125]Open in a new tab Overlap of inference results by four different methods. Subsets containing TFs inferred active by NLBayes are highlighted in blue. Fibroblast phenotypic plasticity To demonstrate the utility of our methodology in discovery of novel biology, we applied our algorithm to study fibroblast-to-myofibroblast phenotypic conversion in response to pro-fibrotic signaling molecules TGFβ and CXCL12 ([126]35,[127]53,[128]54). In this experiment, patient derived, immortalized prostate N1 cells were treated with the pro-fibrotic proteins TGFβ and CXCL12, both of which are known to promote collagen expression ([129]53). TGFβ acts upon TGF/TGFR signaling axis and activates multiple Smad proteins, while CXCL12 acts upon CXCL12/CXCR4-axis, which transactivates EGFR and downstream signaling through MEK/ERK and PI3K/Akt pathways. Both signaling axes converge in the nucleus and promote the expression of multiple collagen genes ([130]35,[131]54). RNA-Seq data was processed and compared to the background model to generate differential gene expression profiles as previously described ([132]53). Differential gene expression data from TGFβ and CXCL12 treated cells were identified at fold change Inline graphic and four cutoff thresholds for P-value. The different cutoffs for P-value were applied to examine the impact of stringency in significance on the inference results, and were chosen such that only 200, 400, 600 or 800 top differentially expressed genes are considered. To achieve this, we sorted the table of differentially expressed genes by P-values in ascending order and took the top rows for the analysis. Both datasets were used as input to the TF activity inference algorithm. For these experiments (and the remaining experiments), we used the 3-tissue TF–gene interaction network generated by Farahmand et. al. This network contains interaction edges that are common in at least three of the tissues used in that work and showed consistent performance across multiple datasets ([133]30). Table [134]5 summarizes the results. Inferred active regulators upon CXCL12 induction are largely similar to that of TGFβ. This is expected as transcriptional profiles induced by TGFβ and CXCL12 are 75% similar ([135]53). The top predicted regulators for TGFβ and CXCL12 for the top 200 DEGs are YAP1, RBPJ, KMT2C, ELF1, STAT1 and BPTF. YAP1 is known to play a role in the development and progression of multiple cancers as a transcriptional regulator of this signaling pathway and may function as a potential target for cancer treatment ([136]71). Moreover, YAP1 has been identified as a driver of myofibroblast differentiation in several tissue phenotypes, like skin, heart, lung, pharynx, liver and kidney ([137]36–45). STAT1 is a member of the STAT protein family. In response to cytokines and growth factors, STAT family members are phosphorylated by the receptor associated kinases, and then form homo- or heterodimers that translocate to the cell nucleus where they act as transcription activators. The protein encoded by this gene can be activated by various ligands including interferon-alpha, interferon-gamma, EGF, PDGF and IL6. This protein mediates the expression of a variety of genes, which is thought to be important for cell viability in response to different cell stimuli ([138]72). RB1 is a negative regulator of the cell cycle. The active, hypophosphorylated form of the protein binds transcription factors from the E2F family, which forms a transcription factor heterodimer that controls the transcription of cell cycle regulatory genes. RB-E2F and MuvB complexes (which contain LIN9) regulate the expression of G1/S and G2/M genes. G1/S genes are repressed during G0 and early G1 by RB-E2F and DREAM complexes that interact through the DNA binding domains (DBD) of E2F/DP with E2F promoter sites ([139]73). Table 5. Predicted active TFs upon TGFβ (left) and CXCL12 induction (right) TGFβ CXCL12 Top DE genes Top DE genes 200 400 600 800 200 400 600 800 Max P-value 4.7E-133 2.6E-95 1.6E-71 8.0E-53 Max P-value 6.0E-132 4.6E-94 1.2E-72 4.1E-58 YAP1 1.00 1.00 1.00 1.00 YAP1 1.00 1.00 1.00 1.00 RBPJ 1.00 1.00 0.53 1.00 BCLAF1 1.00 1.00 1.00 1.00 KMT2C 1.00 1.00 0.96 0.61 BPTF 1.00 1.00 0.82 0.04 ELF1 1.00 1.00 0.07 0.75 RBPJ 0.99 1.00 1.00 1.00 STAT1 0.96 0.96 1.00 0.80 KMT2C 0.98 0.91 0.07 0.62 BPTF 0.92 0.02 1.00 0.95 GABPA 0.98 0.28 1.00 1.00 HIF1A 0.90 1.00 0.05 0.12 STAT1 0.96 0.99 1.00 0.82 BRCA1 0.63 RB1 0.91 0.95 0.99 1.00 YY2 0.50 LIN9 0.89 0.57 0.83 0.99 JMJD1C 0.08 0.90 PRDM1 0.33 0.10 0.02 VEZF1 0.06 0.89 0.09 0.22 CBFB 1.00 0.78 0.17 ERG 0.84 AHR 0.01 0.68 TCF12 0.82 0.86 0.91 SP4 0.45 ZNF12 0.57 0.90 0.25 ATF2 0.37 0.02 1.00 LIN9 0.15 0.33 0.97 0.90 MYBL2 0.02 0.30 0.01 BACH1 0.98 0.85 BACH1 0.27 0.24 0.45 RB1 0.02 0.97 1.00 ELF1 0.05 0.13 0.94 0.38 GABPA 0.09 0.96 1.00 CHD1 0.40 0.10 BCLAF1 0.15 0.96 0.97 ELK4 0.01 0.10 0.74 TCF4 0.93 TCF4 0.63 MEF2A 0.01 0.01 0.48 0.38 PIAS1 0.04 0.59 SETX 0.02 0.03 0.94 SMAD4 0.01 0.17 0.47 ATF2 0.16 0.34 SETX 0.02 0.29 PIAS1 0.05 0.25 IRF1 0.04 0.04 0.20 [140]Open in a new tab The top row shows total number of DEGs using four different P-value cutoff thresholds (Max P-value). Bottom panel lists the inferred regulators. Posterior probabilities greater than or equal to 0.20 are highlighted in bold font face, and those equal to zero are left blank. We note that all these predicted active TFs appear as top regulators for the top 400, 600 and 800 DEGs, indicating the robustness of the algorithm to the number of DEGs used as input. Fibroblast heterogeneity: single cell experiments We performed several scRNA-seq experiments to further investigate the phenotypic plasticity of human prostate fibroblasts and characterize heterogeneity in cell populations. For this study, we utilized 4 human prostate cell lines: N1, SFT1, pHPF and iHPF (For more information, see SI File [141]Supplementary Table S1). N1 cells are HPV E6/E7-immortalized prostate stromal fibroblasts originally explanted and grown from a stromal Nodule of benign prostatic hyperplasia ([142]74). They exhibit a fibroblastic morphology, and express fibroblastic markers vimentin and calponin. These cells demonstrate secretion and proliferation profiles consistent with aging primary prostate fibroblasts. SFT1 cells are spontaneously immortalized prostate fibroblasts grown from a prostate of a patient with a Solitary Fibrous Tumor of the prostate ([143]75). These cells carry an uncommon NAB2/STAT6 fusion gene that is associated with solitary fibrous tumors and likely accounted for cellular immortalization. pHPF cells are primary Human Prostate Fibroblasts, purchased at passage 3 from Lifeline Cell Technology, harvested from young adult male. Finally, iHPF cells are created through transduction from pHPF cells with an EF1α-driven hTERT Lentivirus construct and have grown continuously in culture > 30 passages. We applied scRNA-seq to all 4 cell lines. Figure [144]6A shows the UMAP projection of the cell lines. The N1 and SFT1 form distinct clusters in close proximity. Most of the cells are in G1 phase. pHPF cells also form a single cluster, mostly consisting of cells in G1 phase. Interestingly the iHPF cells cluster in two groups (A and B) that surround the primary pHPF cells. The majority of cells in iHPF_A cluster are in G1 phase, while iHPF_B consists of a mix of cells in G1, G2, S and M phases. To further investigate the identity of these cells, we merged the data with FACS sorted single cell expression data derived from prostate tissue generated by Henry et al. ([145]58). Figure [146]6B shows that the RNA expression profiles of the five human prostate fibroblast cell lines N1, SFT1, iHPF_A and iHPF_B, and primary pHPF cells, cluster as expected with that of tissue-derived human prostate fibroblasts. As seen in Figure [147]6C, the five human prostate fibroblast cell lines share a large signature of highly and commonly expressed genes, likely reflecting their common fibroblastic cell type. In particular, all five cell lines express COL1A1 (collagen 1) and VIM (vimentin). Examination of the top 10 differentially expressed genes in the four immortalized cell lines compared to primary human fibroblasts shows that N1 and SFT1 demonstrate a high degree of overlap and commonly express several inflammation-associated genes (CXCL1, ZNFAS1, CHI3L1). The iHPF_A and iHPF_B share a common gene signature as well that includes gene encoding signaling proteins (BEX1, WNT5A), growth factors and pathways (EREG, IGFBP5), and a gene over-expressed in the autoimmune disease, rheumatoid arthritis (TGM2). However, iHPF_B cells also highly express genes that are not expressed by iHPF_A, including several associated with vasculogenesis or angiogenesis (ANGPT1, F3, ADAMT1) or connective tissue and bone growth (TNFRSF11B). This suggests that iHPF_B cells may phenotypically resemble endothelial cells, which can differentiate from fibroblasts ([148]76). Figure [149]6E quantifies the average log FC of top expressed markers (Figure [150]6C) compared to the background (pHPF). Taken together, these data suggest that a seemingly homogenous culture of primary stromal prostate fibroblasts may comprise several subpopulations as have recently been shown for dermal fibroblasts ([151]77). Figure 6. [152]Figure 6. [153]Open in a new tab Single cell gene expression data from four prostate cell lines. (A) UMAP projection of cell lines. iHPF shows two separate phenotypes here termed iHPF_A and iHPF_B. Cell cycle phase G1 is prominent in N1 and SFT1, but pHPF cells show significant number of cells in S and G2/M phases. (B) Integration single prostate tissue data ([154]58), showing clusters for Fibroblasts (Fib), Smooth Muscle (SM), Basal Epithelia (BE), Luminal Epithelia (LE) and Other Epithelia (OE1, OE2). All 4 cell lines, N1, SFT1, pHPF and iHPF, appear as interconnected clusters, lying between epithelial and fibroblast cells. (C) Top expressed genes. Overall, all 5 cell classes share the same highly expressed genes including COL1A1. (D) Top 10 differentially expressed genes with respect to pHPF as a reference. Smallest and largest P-values are 1E-128 and 1E-26 respectively. See SI File Tables S2 and S3 for a list of top genes for each cell line, as shown in (C) and (D); full list of differentially expressed genes is available in SI Table [155]1. (E) Differential expression with respect to pHPF, for the highly expressed genes shown in (C). FC of genes in panel C compared to pHPF. Notably, even though collagen expression high in across all groups, it is downregulated in N1 when compared to pHPF. Next, we sought to quantitatively characterize similarities between cell lines at the transcriptional level. We first performed a differential gene expression analysis using the pHPF cell line as the background. Figure [156]7A shows a bar plot of total number of upregulated and downregulated genes in each cell line compared to pHPF cells. We performed a gene set enrichment analysis on up & down regulated genes (Figure [157]7B). As expected, the N1 and SFT1 cells demonstrate a high level of similarity, as we have previously shown that they respond similarly to stimulation with pro-fibrotics ([158]54). Conversely, although immortalized from the pHPF cells, iHPF_A and iHPF_B demonstrate a higher-than-expected dissimilarity, potentially reflecting fibroblast heterogeneity in the primary cell culture from which they were derived. Figure 7. [159]Figure 7. [160]Open in a new tab (A) Total number of DEGs compared to the background model (pHPF). (B) GO term Enrichment analysis of up regulated genes in each cell line (columns). See SI File Fig S1 for the GO term Enrichment analysis on down regulated genes. Next, we applied our algorithm to the DEG profiles from each cell line to quantify similarity in transcriptional gene regulation. Figure [161]8 shows top inferred regulators in each cell line (left panel bar plots), along with their RNA expression level (middle panel bar plots), and the corresponding enrichment of the differentially expressed target genes (right panel bar plots). The enrichment analysis was performed by quantifying the overlap between the targets of the TF and DEGs using Fisher's exact test. This analysis was performed for comparison of enrichment-based methods with our approach. Enrichment-based approaches do not consider the global topology of the TF–gene interaction network into consideration and yield results that are purely based on the local overlap of TF targets and the set of DEGs. In Figure [162]8, we observe that many TFs inferred by our method have low enrichment scores. Moreover, some of these show no RNA expression, which suggests that these are false positives. Figure 8. [163]Figure 8. [164]Open in a new tab Active TF inference for each cluster of immortalized cells. Gene markers were identified by comparison with pHPF and fed into the inference algorithm. In each data set, the left panel bar plots show the inferred probability of regulator activity by the algorithm. Vertical lines mark 0.2, 0.5 and 0.8 probability thresholds. The right panel bar plots show the Enrichment Analysis of DE targets (Fisher's Exact test) for comparison. Significant P-values (<0.05) are highlighted in pink. The middle panel bar plots show mean RNA expression across single cells. TFs with expression above 25-percentile are highlighted in red. TFs with inference posterior probability Inline graphic that also show significant enrichment and RNA expression are highlighted in yellow, blue, and green respectively. TFs with posterior probability Inline graphic , but the expression level below the 25 percentile are highlighted in gray. TFs with Inline graphic are shown with a dotted pattern. The bottom panel shows active TFs inferred in each cell line (rows) by the algorithm. N1 and SFT show a similar pattern of TF activity. Among the TF regulators identified, HAND2 was shared across 2 cell lines. The protein encoded by this gene belongs to the basic helix-loop-helix family of transcription factors and, among many other development-related functions, is required for vascular development and regulation of angiogenesis, possibly through a VEGF signaling pathway ([165]78). The N1 and SFT1 cell lines shared expression of CEBPD, GTF2F1 and MXI1. CEBPD is an intron-less gene that encodes a bZIP transcription factor which can bind as a homodimer to certain DNA regulatory regions. It can also form heterodimers with the related protein CEBP-alpha. The encoded protein is important in the regulation of genes involved in immune and inflammatory responses and may be involved in the regulation of genes associated with activation and/or differentiation of macrophages. It may also be involved in the early stages of adipogenesis ([166]79,[167]80). GTF2F1 encodes TFIIF, a general transcription initiation factor that binds to RNA polymerase II and helps to recruit it to the initiation complex in collaboration with TFIIB. It is also a JNK1/3-binding partner and may modulate c-JUN-mediated MAPK signaling in cell proliferation, differentiation, migration, senescence and apoptosis ([168]81). MXI1 encodes a basic helix-loop-helix protein that inhibits the transcriptional activity of MYC by sequestering MAX, thus preventing the formation of MYC-MAX heterodimers, and by competing with MYC-MAX heterodimers for binding to target sites ([169]82). The iHPF_A and iHPF_B cell lines shared expression of the TF regulators PPARG and TCF21. PPARG encodes a member of the peroxisome proliferator-activated receptor (PPAR) subfamily of nuclear receptors. PPARs form heterodimers with retinoid X receptors (RXRs) and these heterodimers regulate transcription of various genes that regulate adipocyte differentiation and, pathologically, the development or progression of obesity, diabetes, atherosclerosis and cancer ([170]83,[171]84). TCF21 encodes a transcription factor of the basic helix-loop-helix family. The TCF21 product is mesoderm specific, and expressed in embryonic epicardium, mesenchyme-derived tissues of lung, gut, gonad, and both mesenchymal and glomerular epithelial cells in the kidney. It is involved in the differentiation of mesenchymal cells to fibroblasts ([172]85). Of note, many of these transcriptional regulators are basic helix-loop-helix TFs, and three TF regulators in particular—CEBPD, TCF21 and HAND2 have been identified as promotors of mesenchymal cell differentiation towards the fibroblast lineage (as opposed to the smooth muscle cell lineage). This suggests that the immortalized fibroblast cell lines express TF regulators that function to maintain the fibroblast phenotype as well as those that may extend this phenotype towards that of immune/inflammatory cells (CEBPD), adipocytes (CEBPD, PPARG) or vascular cells (HAND2). This suggests that fibroblast phenotypic plasticity is perhaps a common rather than exceptional cellular state that may be identified by the expression of particular TF regulators. Discussion In this work, we presented an algorithm for inference of TF activity from differential gene expression profiles and causal graphs. The algorithm incorporates transcriptional logic in the context of Bayesian networks, allowing for probabilistic deviation from deterministic logic rules. The probabilistic framework provides the flexibility for ‘plug-and-play’ integration of various logic models. In this study, we focused on one such model (OR-NOR logic). As a future direction we plan to extend the packages so users can choose the logic prior to running the inference. The queries are run on causal graphs of TF–gene interactions. We provide several options for such graphs assembled from small-scale curated databases ([173]86,[174]87), large-scale public databases ([175]88–90), as well as de novo reconstructed graphs from high-throughput experiments ([176]30). We note that the quality and the coverage of the causal graph has a major impact on the ability of regulator activity inference models. Most curated publicly available network of transcriptional regulation with annotation on mode of regulation are small and very limited in their coverage while other higher coverage networks may consist of noisy inferred interactions. Unlike standard enrichment analysis methods, our framework has been designed to account for noise (applicability of interactions and noise in direction of regulation). Bayesian Networks are Directed Acyclic Graphs (DAGs) and as such, feedback loops cannot be directly modeled in this context, which is a limitation of this approach. Another limitation of our approach is that it is designed to detect TF activation, but not TF deactivation. Moreover, since we only consider the OR-NOR transcriptional regulatory logic, results produced by this approach may miss TFs with alternative regulatory relationships. Since the approach is Bayesian and takes the entire topology of the network into account, by design it outputs a minimal number of TFs whose activation can explain the gene expression data. This is an advantage of our algorithm over enrichment analysis methods that typically contain a large proportion of false positives. The disadvantage may be that sometimes not all true positives get high posteriors probabilities, especially if there are many active regulators present. This can be observed in the validation and benchmark subsections, where we used overexpression experiments as input for the inference algorithm (see Table [177]4 and Figure [178]4). In the MYC overexpression data set, MYC itself was recovered as an active regulator. Although this was not the case for other over-expression data sets, the predicted results collectively point to the relevant biology. It should be noted that over-expression experiments may contain many off-target effects that will confound the inference results. Moreover, the network may not sufficiently capture the regulatory interaction in the context of the biological experiment. Indeed, in the case of MYC, we observe that the ‘regulonbrca’ network contains many interactions between MYC and downstream genes that are not present in the three-tissue network. On the other hand, the three-tissue network provides better performance in other experiments. The main difference between the three-tissue and ‘regulonbrca’ networks is that the former attempts to represent a universal non-context specific network, while the latter is a breast cancer-specific network. The ability of regulator inference algorithms to recover upstream regulators depends on multiple factors, including sufficiency of evidence in the input gene expression data, noise, and importantly encapsulation of the regulatory interaction in the network. Since regulatory networks are dynamic and context-dependent, it is unlikely for any universal network or methodology to capture the exact modulators. However, tools such as ours can be tried with multiple networks and the predictions made by the network should point to the relevant biology. We have provided multiple networks for users to try in our web application. Moreover, users can try custom networks within the web app, or by using the provided Python and R packages. It is worth noting that predicted TFs that show no expression in the data (see Figure [179]8), should be considered false positives. If desired, information about the expression of TFs may be used to filter the network prior to the inference analysis, which might reveal an alternative set TFs that explain the input data. Figure 4. [180]Figure 4. [181]Open in a new tab Comparison of active TF predictions by our method (y axis) and VIPER (x axis), in three separate overexpression experiments. Input network used is the BRCA derived regulon from ([182]15). Jointly predicted regulators are colored in red. Top predictions specific to one algorithm are labeled in black. Gray dots show low confidence predictions by both algorithms. For the inference process, we utilized Gibbs Sampling, an MCMC algorithm that is widely used in Bayesian networks. A drawback from MCMC models in Bayesian networks is the convergence time. We implemented the core of the inference in C++ to reduce the wait time. Convergence time mainly depends on the number of interactions in the network used. For the 3-tissue network which has approximately 250 thousand TF–gene interactions, the run time is of the order of 20 minutes. For the used ARACNe network ‘regulonbrca’, which after filtering contains around 74 thousand TF–gene interactions, the run time is about 2 minutes. Other strategies can be taken to speed up processing time. For instance, an enrichment-based test can be run a priori to exclude TFs with insufficient differentially expressed targets, effectively reducing the network size. This will result in a significant speed up in convergence time, albeit some border line cases may be lost. Our tool is an exploratory discovery tool that provides a narrow list of potentially relevant TFs, summarizing the observed differential gene expression data. This is similar to standard GO term and pathway enrichment analysis that are also typically applied to summarize differential gene expression data. The focus of our tool is transcriptional regulation and our algorithm can be used as a complementary tool in conjunction with enrichment analysis methods. To increase the utility of our algorithm, we provide user-friendly R and Python packages as well as a web-based platform with integrated interactive visualization. The pre-processing steps for speeding up the algorithm are implemented as default in the webserver. As databases of causal transcriptional regulatory interactions become more available, we will integrate them in the web-platform and accordingly optimize the inference algorithm for each network. Data and software availability ‘scRNAseq of Primary and Immortalized Human Prostate Fibroblast Cell Lines’ BioProject Accession number: PRJNA881605 Study Accession number: SRP397809 SRA Accession numbers: SRX17617080, SRX17617081, SRX17617082, SRX17617083 Use NCBI’s SRA toolkit to download the 4 datasets above. For further instructions, see: [183]https://www.ncbi.nlm.nih.gov/sra/docs/sradownload We make our inference algorithm available to use through the following web application: [184]https://umbibio.math.umb.edu/nlbayes Open-source R and Python packages are available at Github: [185]https://github.com/umbibio/nlbayes-r (doi:10.5281/zenodo.7105306) [186]https://github.com/umbibio/nlbayes-python (doi:10.5281/zenodo.7105233) We have used R version 4.1.3 and Python 3.10 to develop and test the corresponding packages. Detailed instructions and examples are available in each corresponding repository. All figures presented in this work can be reproduced by following instructions available at the GitHub repository: [187]https://github.com/argearriojas/nlbayes-reproducibility (doi: 10.5281/zenodo.10116763) Corresponding files needed for generating the figures are made available at Zenodo: [188]https://zenodo.org/records/10116664 (doi: 10.5281/zenodo.10116663). Supplementary data [189]Supplementary Data are available at NARGAB Online. Supplementary Material lqad106_supplemental_files [190]Click here for additional data file.^ (1.4MB, zip) Contributor Information Argenis Arriojas, Department of Mathematics, University of Massachusetts Boston, Boston, MA 02125, USA; Department of Physics, University of Massachusetts Boston, Boston, MA 02125, USA; Center for Personalized Cancer Therapy, University of Massachusetts Boston, Boston, MA 02125, USA. Susan Patalano, Center for Personalized Cancer Therapy, University of Massachusetts Boston, Boston, MA 02125, USA. Jill Macoska, Center for Personalized Cancer Therapy, University of Massachusetts Boston, Boston, MA 02125, USA. Kourosh Zarringhalam, Department of Mathematics, University of Massachusetts Boston, Boston, MA 02125, USA; Center for Personalized Cancer Therapy, University of Massachusetts Boston, Boston, MA 02125, USA. Funding This work was supported by grants AI150090 (KZ), R01AI167570 (KZ), DK104310 (JAM), and CA156734 (to J.A.M., K.Z.) from the National Institutes of Health; A.A. was supported in part by College of Science and Mathematics Dean's Doctoral Research Fellowship through fellowship support from Oracle [ID R20000000025727]. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. Conflict of interest statement. None declared. References