Abstract

   The interactions among the components of a living cell that constitute
   the gene regulatory network (GRN) can be inferred from
   perturbation-based gene expression data. Such networks are useful for
   providing mechanistic insights of a biological system. In order to
   explore the feasibility and quality of GRN inference at a large scale,
   we used the L1000 data where ~1000 genes have been perturbed and their
   expression levels have been quantified in 9 cancer cell lines. We found
   that these datasets have a very low signal-to-noise ratio (SNR) level
   causing them to be too uninformative to infer accurate GRNs. We
   developed a gene reduction pipeline in which we eliminate uninformative
   genes from the system using a selection criterion based on SNR, until
   reaching an informative subset. The results show that our pipeline can
   identify an informative subset in an overall uninformative dataset,
   allowing inference of accurate subset GRNs. The accurate GRNs were
   functionally characterized and potential novel cancer-related
   regulatory interactions were identified.

   Subject terms: Regulatory networks, Cancer, Reverse engineering

Introduction

   Living organisms are orchestrated by the biochemical reactions that
   occur as a result of the interactions between biomolecules. For that
   reason, understanding the biochemical, physiological, and pathological
   processes from a gene regulation perspective is of importance. These
   processes in a living organism on a gene level can be represented via a
   gene regulatory network (GRN) that can be inferred from
   perturbation-based gene expression data, e.g., where each gene in the
   system is knocked down in a separate experiment. Such GRNs are useful
   for providing mechanistic insights of a biological system. Accurately
   performed inference on cancer data might propose alternative treatments
   since novel gene regulatory links carry the potential to identify new
   drug targets.

   Several GRN inference methods have been proposed including least
   squares, LASSO^[36]1,[37]2, Robust Network Inference algorithm
   (RNI)^[38]3, Context Likelihood of Relatedness (CLR)^[39]4,
   Genie3^[40]5, and Inferelator^[41]6. However, the inference of the
   mentioned GRNs from the biological data has severe limitations, among
   which the most crucial one is noise of the dataset which negatively
   affects the performance of any inference method. It has been shown that
   even the best performing inference methods tend to exhibit very poor
   accuracy if the signal-to-noise ratio (SNR) is low^[42]7, and that real
   datasets have low SNR^[43]8. Previous work has been done in order to
   propose solutions for this problem in GRN inference, mostly exploiting
   stability during bootstrapping as a means to reduce the effect of
   noise. For instance, a nested bootstrapping algorithm has been proposed
   to control the false discovery rate (FDR) in GRN inference of noisy
   datasets, in order to improve the accuracy of the applied method^[44]8.
   However, these solutions are applied directly to inferred GRNs of noisy
   datasets, and if the dataset as a whole is not informative then the
   improvement may be limited.

   In this study, in an attempt to overcome the challenge of experimental
   noise, a subset-selection pipeline was developed where the main aim is
   to improve the SNR of the dataset by permanently removing the least
   informative genes and their experiments from the system until reaching
   an informative subset. The novelty of this study lies in the collection
   of the most informative genes in a dataset before the GRN inference is
   performed, and reaching a subset allowing the inference of the accurate
   GRNs.

Results

   Previous studies have shown that accurate GRN inference relies mainly
   on the SNR of the datasets^[45]7–[46]10. For this reason we developed a
   subset-selection pipeline to improve the SNR level of the system.

   In order to identify the genes lowering the system’s SNR level, the
   pipeline temporarily removes each gene from the dataset, and the SNR of
   the remaining subset is measured followed by the return of the removed
   gene into the dataset. After this is done for all genes, the gene whose
   removal caused the largest SNR gain is permanently removed from the
   system. This procedure is repeated for each reduced subset until only
   two genes remain (Fig. [47]1a, b).

Fig. 1. Workflow of the subset-selection algorithm.

   [48]Fig. 1
   [49]Open in a new tab

   a The subset-selection algorithm, where each gene is removed together
   with its knockdown experiments, and SNR of the remaining dataset is
   measured. The gene is then put back and the procedure is repeated for
   all genes in the dataset. b The inner part of the algorithm showing the
   changes in SNR after each removal and the detection of the gene whose
   removal increases SNR the most and therefore will be permanently
   removed. c The simulation step applied for the calculation of the
   expected accuracy of the GRN inference, where A[true] refers to the
   true GRN that can either be fully synthetic or estimated from the real
   data, Y[sim] denotes the generated expression matrix from A[true],
   A[inferred] is the inferred GRN, while the accuracy was evaluated in
   terms of the area under the ROC and precision-recall curves.

   The reduction algorithm was applied to data from generated true GRNs of
   size 250–1000 genes from GeneSPIDER^[50]7, and a 200-gene GRN and
   dataset from GeneNetWeaver (GNW)^[51]11 for assessing the performance
   of the pipeline. The genes were also removed randomly for comparison.
   Figure [52]2 exhibits the performances of the reduction algorithm and
   of random removal in terms of the area under the receiver operating
   characteristic and precision-recall curves (AUROC and AUPR,
   respectively) for the size of 750 genes. Accuracy results from other
   sizes are displayed in Supplementary Figs. [53]1–[54]3. Figure [55]3
   shows the performance of the reduction algorithm on the GNW data in
   comparison with random removal. Each AUROC and AUPR point in Figs [56]2
   and [57]3 were obtained by GRN inference with least squares with
   cut-off (LSCO)^[58]12 from the reduced dataset, as described in the
   “GRN inference methods” subheading in the Methods section.

Fig. 2. Performance of the subset-selection algorithm on the 750-gene
GeneSPIDER synthetic dataset.

   Fig. 2
   [59]Open in a new tab

   The performance in terms of AUROC and AUPR of the subset-selection
   algorithm is shown, compared to the performance of random gene removal.
   The x-axis represents the remaining subset size.

Fig. 3. Performance of the gene reduction algorithm on the 200-gene
GeneNetWeaver synthetic dataset.

   Fig. 3
   [60]Open in a new tab

   The performance in terms of AUROC and AUPR of the subset-selection
   algorithm is shown, compared to the performance of random gene removal.
   The x-axis represents the remaining subset size.

   The reduction algorithm improves accuracy and SNR considerably better
   than random selection. The reason that random subset selection has a
   positive effect on these measures is that smaller subsets have fewer
   degrees of freedom and are therefore easier to predict and have higher
   SNR. To assess the statistical significance of the difference between
   the performances of the reduction algorithm and random removal, the
   paired Wilcoxon test was applied to the SNR, AUROC, and AUPR levels
   obtained from both removal strategies. The use of a nonparametric test
   was due to not meeting the normality assumption in any of the variables
   according to the Shapiro–Wilk test. For all measures from all dataset
   sizes except one (AUPR of 250 genes), the reduction algorithm was
   statistically significantly different (p < 3.1e−05) from random removal
   (Supplementary Table [61]1).

   When starting with 750 genes in the GeneSPIDER data, AUROC reached a
   plateau at almost the perfect level and AUPR made a sharp increase when
   ~200 genes remained in the dataset (Fig. [62]2). 500- and 1000-gene
   datasets followed a similar trend for AUROC in terms of reaching a
   plateau, and a constant increase until the very end was observed for
   the AUPR values with the 500-gene dataset performing considerably
   better than the 100-gene dataset (Supplementary Figs. [63]2 and [64]3).
   On the other hand, for the 250-gene dataset a constant increase in both
   AUROC and AUPR values was observed, which is probably due to the small
   starting size (Supplementary Fig. [65]1).

   For the 200-gene GNW data, a marked increase in both AUROC and AUPR
   values occurred when ~65 genes remained, and the AUROC level reached
   almost 1 when ~30 genes remained (Fig. [66]3).

   We also investigated a simpler approach of selecting genes based on
   their entropy (Supplementary Figs [67]51–[68]54). For all starting
   sizes, this method did not outperform random selection in terms of
   AUROC. For starting sizes 500 and 750 genes it also did not improve
   compared to random selection in AUPR, but for 250 and 1000 genes some
   improvement was observed, yet at modest AUPR levels. In conclusion, the
   entropy-based subset-selection method generally performed poorly and
   cannot be recommended.

   In the analyses and inferences of the selected L1000 datasets, the
   reduction pipeline was run on the datasets and the SNR level of subsets
   of each cell line was measured. Simulation and benchmarking of expected
   inference accuracy at different SNR levels using in silico data were
   performed. Finally, GRNs were inferred from the most informative subset
   of each cell line and functional analysis and classification were made
   for these accurate subset GRNs.

Benchmarking the reduction pipeline

   The subset-selection pipeline was applied to all L1000 cell lines. It
   progressively removes the gene whose exclusion improves SNR the most,
   continually increasing the SNR of the dataset. In Fig. [69]4, the
   gradual increase of the SNR levels reflecting the informativeness of
   the remaining subsets is shown.

Fig. 4. The SNR level of SNR-enriched subsets of the nine selected L1000 cell
lines.

   Fig. 4
   [70]Open in a new tab

   The x-axis represents the subset size, the y-axis denotes the SNR, and
   each curve represents a cell line.

   In order to determine the most suitable inference method for network
   generation and GRN inference, both LSCO and LASSO were benchmarked on
   different subset sizes and SNR levels. The results from the A375 cell
   line are shown in Fig. [71]5, and from all cell lines in Supplementary
   Figs [72]22–[73]39.

Fig. 5. Evaluation of methods for benchmarking GRN inference accuracy with
simulation.

   Fig. 5
   [74]Open in a new tab

   The evaluation was made for the subsets of the A375 cell line in terms
   of AUROC (a) and AUPR (b). The legend shows the algorithm used to
   generate the true GRN for benchmarking, and the algorithm used to infer
   the GRNs from the simulated data as the first and second labels. For
   example, ‘LSCO & LASSO’ denotes that the true GRN was generated with
   LSCO and the GRNs were inferred with LASSO.

   It can be observed from the benchmarking of the two inference methods
   on the subsets of A375 cell line (Fig. [75]5) that a radical increase
   in both AUROC and AUPR levels starts at the 50-gene subset, and that
   LSCO generally outperforms LASSO at both true GRN generation and GRN
   inference. Therefore, the LSCO method was chosen for both the true GRN
   generation and the GRN inference for its high accuracy and
   computational efficiency.

   The accuracy for subsets using LSCO sharply improves at around 50 genes
   (Fig. [76]6). For this reason, the inference of the “accurate GRNs” was
   performed on the 50-gene subsets of all cell lines.

Fig. 6. Expected accuracy of subset GRNs.

   Fig. 6
   [77]Open in a new tab

   The accuracy was derived from simulations using LSCO and measured as
   AUROC (a) and AUPR (b) on the nine L1000 cell lines. The x-axis
   represents the subset size, the y-axis the AUROC and AUPR values, and
   each curve represents a cell line.

Inference of accurate subset GRNs from L1000 subsets

   After establishing that the 50 most informative genes result in
   accurate GRNs, we next inferred such GRNs by applying LSCO to these
   subsets in each of nine L1000 cell lines. An average sparsity of about
   3–5 links per gene is considered biologically
   plausible^[78]3,[79]13,[80]14. In order to get close to this, we
   selected GRNs with a sparsity ranging from 2 to 5 links per gene. Among
   these, the one whose number of links matches best with the number from
   the simulation resulting in the most optimal accuracy was selected, and
   called “the accurate GRN”. The accurate subset GRN of the HT29 cell
   line is shown in Fig. [81]7, and the accurate subset GRNs of all nine
   L1000 cancer cell lines can be found in Supplementary Figs
   [82]40–[83]48.

Fig. 7. The accurate GRN of the HT29 colon cancer cell line.

   Fig. 7
   [84]Open in a new tab

   The nodes demonstrate the informative genes of this L1000 cell line,
   and blue and red edges represent the regulatory interactions which are
   either activation or suppression, respectively.

   The SDHB gene is a tumor suppressor gene, and has previously been shown
   to be downregulated in colorectal cancer^[85]15. However, to our
   knowledge, possible activators for treatment have not yet been
   proposed. In our inferred subset GRN for the HT29 cell line (Fig.
   [86]7), several genes including DNAJB12, TOMM22, TMEM97, S100A4,
   RABGGTA, and BLCAP suppress SDHB, suggesting possible mechanisms of
   SDHB suppression. A database search of these genes was performed in the
   TRRUST^[87]16 and RegNetwork^[88]17 databases but none of these genes
   was found to regulate SDHB. We also searched the network databases
   FunCoup^[89]18 (links with >0.80 confidence), Pathway Commons^[90]19,
   and GeneMania^[91]20 and found support for a link between SDHB and
   TOMM22 in both FunCoup (confidence 0.997) and GeneMania, and between
   SDHB and DNAJB12 in GeneMania. We therefore propose that the inferred
   regulators of SDHB could be valid potential regulatory targets for
   colorectal cancer treatment.

   The inferred subset GRN from the PC3 prostate cancer cell line
   (Supplementary Fig. [92]47) suggests several genes as suppressors of
   SDHB, such as PNKB, PWP1, PNP, HADH, HTATSF1, and BAZ1B, among which
   PWP1, HTATSF1, and BAZ1B are transcription regulators. BAZB1 is a
   kinase, and HTATSF1 has RNA-binding activity. SRRM1 activates SDHB in
   the subset GRN. Again, these relations are not present in the TRRUST
   and RegNetwork databases. We, however, found that SDHB is linked to PNP
   and HADH in the FunCoup (confidence = 0.827 and 0.999) and GeneMania
   networks. The relation between SDHB and HADH also appears in Pathway
   Commons, and GeneMania links SDHB to SRRM1. These network and pathway
   connections provide support for the validity of these novel regulatory
   mechanisms for prostate cancer treatment.

   The inferred subset GRN from the MCF7 breast cancer cell line
   (Supplementary Fig. [93]46) highlights that the GM2A and PRKACA genes
   are activated and suppressed by several genes. Overexpression of GM2A
   in breast cancer has previously been shown^[94]21, where one of the
   investigated cell lines was MCF7. In the inferred GRN, GM2A is
   activated by NOL3, AKT1, KIAA0196, and ACOT9, and suppressed by LIPA,
   PWP1, JMJD6, and CCDC86, among which PWP1 and JMJD6 are transcription
   regulators. The JMJD6 protein has RNA binding and histone demethylase
   activity. Although these interactions with GM2A are not present in the
   TRRUST and RegNetwork databases, indirect evidence of their validity
   was found in RegNetwork where GM2A and its identified regulator NOL3
   are regulated by two third-party genes, TFAP2A (q = 0.0007) that is a
   DNA-binding transcription regulator and an enzyme, and JUN
   (q = 0.0006), a DNA-binding transcription regulator and oncogene. The
   likelihood of finding a common regulator by chance was calculated as
   the square of the total number of targets of each common regulator
   divided by the total number of protein-coding genes, followed by
   Bonferroni correction of these probabilities and excluding cases above
   0.05. In addition to this, the presence of a link between GM2A and LIPA
   was found in GeneMania. This adds support to the validity of our
   identified interactions as potential mechanisms in breast cancer.

   Another example is the PRKACA gene, an oncogene that has been linked to
   breast cancer^[95]22. The inferred GRN indicates that it is suppressed
   by NUP85 and IST1H2B, and activated by CHAC1, PWP1, PP2R5E, NOL3, and
   PNP, of which PWP1, as stated above, is a transcription regulator.
   Searching the TRRUST and RegNetwork databases did not verify any of our
   predicted PRKACA interactions, yet the connection between PRKACA and
   NOL3 was indirectly supported in RegNetwork by a third gene, E2F1, a
   DNA-binding transcription regulator, that regulates both (q = 0.02).
   Further evidence was found in that FunCoup (confidence 0.991), Pathway
   Commons, and GeneMania contain a link between PRKACA and NUP85. These
   observations support the validity of our proposed regulatory mechanisms
   for breast cancer treatment.

Functional analysis of the accurate subset GRNs

   In order to characterize and validate the accurate subset GRNs, we
   analyzed them for pathway enrichment, overlap with a functional
   association network database, and protein class.

   The pathway enrichment analysis in this study was performed via PathwAX
   wherein the genes of the accurate subnetworks were tested for
   significant association to KEGG pathways^[96]23 with network crosstalk
   enrichment analysis. The results are in Supplementary Table [97]6. We
   note that cancer-related pathways such as ‘cell cycle’, ‘Oxidative
   phosphorylation’, and ‘Protein processing’ were significantly enriched
   in several of the cell lines. Among other recurring pathways we note
   ‘Alzheimer’s disease’, ‘Parkinson’s disease’, and ‘Huntington’s
   disease’, which have been linked mechanistically to
   cancer^[98]24–[99]26.

   The pathway ‘Peroxisome’ was significantly enriched for the subset GRN
   of the prostate cancer cell line PC3. The peroxisome has previously
   been linked to prostate cancer^[100]27–[101]30, and PC3 was one of the
   investigated cell lines in the study by Mueller et al.^[102]27. The
   subset is thus highly relevant for the connection between prostate
   cancer and the peroxisome. CAT, a member of Peroxisome pathway in the
   antioxidant system, is in the subset GRN activated by CEBPZ, TMCO1,
   CLASRP, and PPP2R3A. Of these, CEBPZ is a transcriptional regulator,
   yet its regulatory effect on the expression of CAT is not known.

   The inferred GRNs were further compared to the FunCoup functional
   association network database^[103]18 (Table [104]1). In general,
   relatively few of the inferred regulatory links were found in FunCoup,
   which is perhaps not surprising given that FunCoup contains undirected
   functional links that do not represent regulatory interactions.
   However, the overlap was significant (p < 0.05) in five cell lines with
   a hypergeometric test. One of these, MCF7, matched no significant
   pathways above, and PC3 only one pathway, indicating that these GRNs
   may represent unknown mechanisms. In order to investigate whether the
   significant match was due to the success of the subset-selection
   algorithm identifying an informative subset for accurate GRN inference
   or not, we also inferred GRNs from the full-size datasets of the L1000
   cell lines and compared the inferred GRNs with a sparsity of ~3 links
   per node on average to the related FunCoup networks and found that no
   overlap was significant (Supplementary Table [105]5).

Table 1.

   The comparison of the accurate subset GRN of each cell line to the
   related FunCoup network of the same genes.
   Cell line Overlap Links in FunCoup Links in the inferred GRN p-Value
   A375      6       36               84                        0.001*
   A549      4       106              130                       0.83
   HA1E      6       90               96                        0.14
   HCC515    6       74               104                       0.09
   HEPG2     7       33               221                       0.02*
   HT29      13      97               197                       0.04*
   MCF7      19      79               125                       5e−09*
   PC3       10      77               91                        0.0004*
   VCAP      10      88               170                       0.08
   [106]Open in a new tab

   * Marks statistical significance with p < 0.05.

   Protein class analyses were performed for the genes of the selected
   subsets in terms of the activity of their proteins. In order to achieve
   this, the human proteome was downloaded from UniProt^[107]31, including
   the keywords transcription regulation, transmembrane, DNA binding,
   kinase, metabolism, RNA-binding, enzyme (from molecular function), and
   signaling. Additional data were downloaded from COSMIC^[108]32 from
   which oncogenes and tumor suppressor genes were obtained. The genes
   belonging to each selected cell line were assigned a protein class
   according to this scheme (Fig. [109]8).

Fig. 8. Protein class enrichment of each cell line and their 50-gene subsets.

   Fig. 8
   [110]Open in a new tab

   The x-axis shows the cell line, and the y-axis shows the class
   enrichment relative to UniProt fractions as log(ratio).

   Most of these classes tend to be enriched in L1000 gene sets relative
   to UniProt, particularly kinase, metabolism, enzyme, oncogene, and
   tumor suppressor gene. Mainly transmembrane and DNA binding tend to be
   depleted. The highest enrichments in subsets relative to their original
   L1000 set were observed for A375 (kinase, enzyme, RNA binding), MCF7
   (signaling, RNA binding, metabolism), PC3 (transcription regulation,
   signaling, RNA binding), and HCC515 (transcription regulation). These
   cell lines thus underwent a further enrichment of regulatory functions
   during the subset selection. A few cases of further depletion were also
   observed, for instance of DNA binding in A375 and A549.

Knockdown effect on the target

   Since a successful knockdown of the target gene is important, the
   knockdown effects on the targets were investigated and are shown in
   Fig. [111]9. Volcano plots of the knockdown effects are provided in the
   supplementary material (Supplementary Figs. [112]4–[113]21).
   Significance analyses of the targeted genes show that a majority of the
   targets were successfully downregulated. However, in addition to the
   nonsignificant changes in their expressions, a portion of the targets
   was observed to be significantly upregulated. In A375, A549, HCC515,
   HEPG2, and HT29, 6–10% of the genes were significantly upregulated,
   while this proportion increases up to 20–26% in HA1E, MCF7, PC3, and
   VCAP cell lines.

Fig. 9. Knockdown effect on target.

   Fig. 9
   [114]Open in a new tab

   Significant and nonsignificant up- and downregulation of the shRNA
   target genes in the studied L1000 cell lines.

Discussion

   Uninformative data have previously been shown to negatively affect the
   performance of GRN inference methods, leading to poor accuracy. In
   order to overcome this problem, a progressive subset-selection method
   is proposed, removing the genes that are the most affected by noise.
   This approach proved to capture the most informative and therefore the
   most accurately inferrable subsets, which can be considered a major
   advance in the GRN inference area since the high noise problem is a
   general issue for all real datasets.

   The subset-selection algorithm was validated by synthetic data and true
   GRNs. The performance of the algorithm was shown to significantly
   outperform random removal of genes. We applied the algorithm to L1000
   datasets of nine selected cancer cell lines and observed that we could
   increase SNR to a level that permits accurate GRN inference.

   We examined whether the experimental L1000 perturbations were
   successful via nonparametric significance analysis of the targeted
   genes’ expression and visualized them by volcano plots. A majority of
   the target genes were significantly knocked down compared to their
   controls. However, also many significantly upregulated genes were
   observed, as well as nonsignificantly up- and downregulated ones. One
   possible reason for this from the biological perspective is that these
   genes are highly important for the system, and their expression is
   immediately restored by feedback mechanisms in the cell. If the
   reaction is very strong, this could lead to temporal overcompensation
   and an observed increase in expression. This could be revealed by a
   time series of measurements, but unfortunately the L1000 shRNA
   perturbations are limited to the 96 h time point.

   In the benchmarking of the real data, using in silico datasets where
   the properties of the real datasets are mimicked, a significant
   improvement in LSCO’s inference accuracy in terms of AUROC and AUPR was
   observed when increasing the SNR level of the dataset as the subset
   becomes smaller. The improvement was stronger and came earlier for LSCO
   compared to LASSO, especially when used for inferring a true GRN for
   simulations. Given this variability, it is possible that combinations
   of other inference algorithms might perform even better.

   The benchmarking of different subset sizes in different cell lines with
   LSCO indicated that in order to infer accurate GRNs, the minimum SNR
   level of the dataset should be at least ~0.05. For the L1000 datasets,
   this means losing more than 90% of the initial data with the proposed
   subset-selection algorithm. However, in a situation where the majority
   of the genes have noisy measurements, it becomes unbeneficial to
   include them in the system. In GRN inference where the aim is to
   reliably discover regulatory interactions, including noisy data that
   lowers the informativeness of the dataset is unacceptable. On the other
   hand, the possible removal of important genes such as oncogenes may
   obscure their potential interactions with other genes, and cause their
   effect in cancer to remain unknown. In an attempt to minimize this
   risk, such genes can be kept. However, then more genes must be
   sacrificed in order to achieve the desired SNR for the inference to be
   accurate. This would result in a smaller subset, which may be a greater
   drawback.

   The time complexity of the subset-selection algorithm is in principle
   O(n^2) for n genes. However, due to the currently used SNR calculation
   at each step this is further increased to O(n^3) because it relies on
   singular value decomposition. This time complexity means that while it
   is fast up to a few hundred genes, for datasets of around 1000 genes
   several days of computation might be needed to complete the subset
   selection.

   One result of this study is accurate subset GRNs for nine cancer cell
   lines. These include a large number of new regulatory interactions
   among transcriptional regulators and oncogenes, and suggest new
   potential therapeutic targets. We have focussed on a few cases of
   cancer relevance that would deserve experimental follow-up and
   validation.

Methods

Calculation of the SNR

   The smallest value from the singular value decomposition of the gene
   expression matrix is considered to be the signal, and the variance
   multiplied by a degrees-of-freedom-dependent chi-square constant is
   considered to be the noise in this study. The ratio between the signal
   and noise is shown in Eq. ([115]1).
   [MATH: <mrow><mi
   mathvariant="normal">SNR</mi><mo>≡</mo><mfrac><mrow><msub><mrow><mi>σ</
   mi></mrow><mrow><mi>N</mi></mrow></msub><mfenced close=")"
   open="("><mrow><mi>Y</mi></mrow></mfenced></mrow><mrow><msqrt><mrow><ms
   up><mrow><mi>χ</mi></mrow><mrow><mo>−</mo><mn>2</mn></mrow></msup><mfen
   ced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>α</mi><mo>,</mo><mi>N</mi><mi>M<
   /mi></mrow></mfenced><mi>λ</mi></mrow></msqrt></mrow></mfrac></mrow>
   :MATH]
   1

   where σ[N](Y) denotes the smallest value from the singular values of
   the expression matrix Y,
   [MATH:
   <msqrt><mrow><msup><mrow><mi>χ</mi></mrow><mrow><mo>−</mo><mn>2</mn></m
   row></msup><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>α</mi><mo>,</mo><mi>N</mi><mi>M<
   /mi></mrow></mfenced></mrow></msqrt> :MATH]
   the inverse chi-square distribution with 1 − α confidence level and NM
   degrees of freedom (N genes, M samples), and λ the variance of the
   noise.

Datasets

   The pipeline was applied to selected datasets from the LINCS L1000
   data^[116]33. However, for the calculation of the expected accuracy of
   the inference and the validation of the existing links in the
   corresponding GRN, in silico networks and perturbation datasets with
   properties similar to the real datasets were generated ranging from the
   full size of the dataset to smaller but more informative subset sizes.
   Supplementary Figs [117]22–[118]39 show all the benchmark results using
   either LSCO or LASSO for inference of a true GRN to generate simulated
   data as well as inference of GRNs from simulated data, in terms of
   AUROC and AUPR measurements for the subsets of the nine L1000 cell
   lines.

Synthetic GRNs and datasets

   In this study, we used two different in silico GRN and data generation
   tools, namely GeneSPIDER^[119]7, and GeneNetWeaver^[120]11 (GNW). In
   the synthetic data, the most important property is considered to be the
   SNR due to the basis of the proposed subset-selection algorithm. For
   this reason, we set the SNR levels of the GeneSPIDER datasets to 0.001
   in accordance with the SNR levels of the L1000 datasets (Table [121]2).
   For the GeneSPIDER datasets, this was possible as the tool itself
   allows this property to be set by the user. However, in GNW, this was
   not an option. On the other hand, by using their default settings for
   the noise parameters (0.05 coefficient for stochastic, and 0.025
   standard deviation for the Gaussian noise), we were able to obtain a
   final dataset with 0.0079 SNR, which is reasonably close to the SNR
   levels of the L1000 and GeneSPIDER datasets.

Table 2.

   The main properties of the L1000 datasets that are used in the
   subset-selection pipeline.
   Cell line      Tissue         Size     SNR
   A375      Skin             649 × 1879 0.0020
   A549      Lung             744 × 2016 0.0017
   HA1E      Embryonic kidney 716 × 2092 0.0017
   HCC515    Lung             803 × 2390 0.0010
   HEPG2     Liver            650 × 2129 0.0030
   HT29      Colon            746 × 2901 0.0012
   MFC7      Breast           512 × 1282 0.0030
   PC3       Prostate         697 × 1670 0.0023
   VCAP      Prostate         664 × 1717 0.0021
   [122]Open in a new tab

   The sizes of the datasets are represented by (genes × experiments).

   In order to explore the effectiveness of the subset-selection algorithm
   for various SNR values, we also generated two different datasets from
   the 250-gene GeneSPIDER network with SNR of 0.01 and 0.0001 which are
   10-times higher and lower, respectively, than the biologically
   realistic SNR values (~10^−3) (Table [123]2). The accuracy of the
   subset selection on these datasets can be found in Supplementary Figs
   [124]49–[125]50.

Generation via GeneSPIDER

   To assess the performance accuracy of the algorithm, synthetic GRNs in
   various sizes (250, 500, 750, and 1000 gene), and datasets from these
   GRNs with three replicates per gene and SNR of 0.001 were generated
   with the GeneSPIDER Network.m and Dataset.m tools. The true GRNs were
   generated in scale-free topology allowing three links per gene on
   average. Random Gaussian noise was generated with a standard deviation
   calculated specifically to meet the requirement for the desired SNR
   level, and then added to the noise-free gene expression matrix
   generated from the true GRN (Eqs. [126]1–[127]2).

   For real data, “true GRNs” were generated by first inferring GRNs at
   several sparsities with the least-squares cut-off (LSCO) method, and
   then selecting the most sparse GRN with at least 2 links per node on
   average. The preference of this method was made based on its
   computational efficiency on large scale datasets and competitive
   accuracy shown in the small benchmark whose results are provided in
   Fig. [128]5 for A375 cell line and in the supplementary material for
   all nine cell lines (Supplementary Figs. [129]22–[130]39). Second,
   these inferred networks were used as the true GRNs for the generation
   of the in silico datasets, which was performed by involving the
   perturbation design matrices (P) of the real datasets in order to
   directly reflect the number of replicates into the simulation, as well
   as adding random noise calculated based on the SNR levels of each
   dataset and their subsets.

   The linear model for the expression matrix generation is shown in Eq.
   ([131]2).
   [MATH: <mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi
   mathvariant="normal">Generated</mi></mrow></msub><mo>=</mo><mo>−</mo><m
   sup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>A</mi></mrow><mrow><mi
   mathvariant="normal">True</mi></mrow></msub></mrow></mfenced></mrow><mr
   ow><mo>−</mo><mn>1</mn></mrow></msup><mo>×</mo><msub><mrow><mi>P</mi></
   mrow><mrow><mi
   mathvariant="normal">Real</mi></mrow></msub><mo>+</mo><msub><mrow><mi>E
   </mi></mrow><mrow><mi
   mathvariant="normal">Generated</mi></mrow></msub></mrow> :MATH]
   2

   In Eq. [132](2), A[True] is the network matrix which is inferred from
   the real dataset with the LSCO method and then stabilized. P[Real]
   indicates the original perturbation matrix of the real dataset or its
   subset, E[Generated] the random noise matrix generated based on the SNR
   level of the real dataset that the simulations are performed for, and
   Y[Generated] the generated expression matrix. Y[Generated] and P[Real]
   are used to infer a GRN, which is then compared to A[True] for
   calculating the expected accuracy of the inference of the real dataset.
   The simulation procedure is visualized in Fig. [133]1c.

Generation via GeneNetWeaver

   To assess the performance of the proposed subset-selection pipeline on
   another synthetic data coming from a different tool, an additional true
   GRN of 200 genes was extracted from the E. coli network that is
   available in GNW, and three datasets from this “true” GRN were
   generated that were afterward used as the biological replicates in one
   merged dataset of size 200-by-600. The 200-gene network was extracted
   from the full-size E. coli network by requesting at least 100
   regulators from random vertices, and neighbor selection was made by the
   GNW setting “random among top 50%”. A kinetic model was then generated
   by keeping all autoregulators, which was used to generate datasets.
   Three knockdown datasets from this kinetic model were generated from a
   combination of ordinary differential equations and stochastic models.
   We avoided the normalization step at this moment, and saved it for the
   merged dataset. We further calculated the log2 fold changes for each
   dataset as the logarithm base 2 of the ratio between each gene’s
   expression and the control. Some of the control values happened to be
   zeros, which would cause infinite values for the fold changes as matlab
   assigns these for the division by zero. To avoid infinite values in the
   fold changes, we added an arbitrary value of 0.001 to all controls. We
   then set the infinite log2 fold change values to zeros as matlab
   assigns infinite values to log[2](0). After this was done for all three
   datasets, we merged them into one single data matrix, and then
   normalized each gene over its three replicates by subtracting the mean
   and dividing by the standard deviation.

Reduction of true GRN during subset selection

   For synthetic data, the true GRN of the dataset is reduced iteratively
   in accordance with the removal of the same gene from the dataset.
   However, due to the lack of a true GRN in the case of real data, a
   different approach is applied here. At regular subset intervals
   (N = 50), a simulation step is included in this pipeline which
   generates a subset GRN and data with the same SNR as the real data
   subset, which is used to estimate the expected accuracy of a GRN
   inferred for the subset (Fig. [134]1c).

L1000 datasets

   L1000 is a collection of perturbation datasets produced by the namesake
   platform, including shRNA (short-hairpin RNA), small molecule and
   protein perturbations performed on 77 cancer cell lines, gathered from
   the measurements on various time points such as 6, 12, 24, 72, 96, 120,
   and 144 h. The Q2Norm Level 3 L1000 data were downloaded from
   [135]https://www.ncbi.nlm.nih.gov/geo/ with the accession [136]GSE92742
   including all the mentioned types of perturbations, time points,
   biological and technical replicates of the measured and inferred genes.
   Level 3 was used because it is the only provided normalized data that
   does not collapse replicates. Only directly measured expression levels
   were used, not inferred ones. In this study, only the shRNA
   perturbations collected at the 96 h time point in eight cell lines
   (A375, A549, HA1E, HCC515, HEPG2, HT29, MCF7, and PC3) and 120 h of one
   cell line (VCAP) were considered for inference. The inconsistent number
   of biological replicates per gene is not considered a problem, hence
   all biological replicates were kept. In each biological replicate,
   usually multiple shRNAs are used that target different parts of the
   same gene in order to mitigate off-target effects. However, sometimes
   different biological replicates for the same target gene use different
   sets of shRNAs. To avoid biases we only kept those common among all
   biological replicates of one target gene.

   Fold changes were calculated for GRN inference using the included empty
   vector controls per cell line. Table [137]2 describes the dimensions in
   each of the nine selected cell lines (genes × experiments) as well as
   their SNR level.

Performance evaluation

   The performance of the subset-selection pipeline was calculated from
   synthetic true GRNs and datasets. The pipeline was applied by removing
   the genes not only from the dataset but also from the true GRN to
   calculate the accuracy of inferred GRNs after every removal and for
   further assessment. As a baseline for comparison, genes were also
   removed randomly from the true GRNs and datasets. This enabled the
   calculation of the statistical significance of the proposed pipeline,
   which are shown in Supplementary Table [138]1. To compare SNR-based
   subset selection to a procedure based on variation, we also applied
   entropy-based removal on the synthetic data where the gene entropy was
   calculated as in Zambelli et al.^[139]34 (Supplementary Figs.
   [140]51–[141]54). The limitation of not having known GRNs for the L1000
   datasets is overcome by means of synthetic data generated from the
   “true GRNs” that are generated as described above. Inferred GRNs based
   on these in silico datasets can then be benchmarked against the “true
   GRNs”. GRN inference can be performed via a variety of
   perturbation-based inference methods such as LSCO or LASSO. Both of
   these methods were applied to each subset.

   The decision of the best inference method in each case was followed by
   the detection of the optimal GRN sparsity providing the best accuracy
   among all the inferred ones ranging from full to empty. In this case,
   rather than the application of a certain model selection criterion such
   as AIC^[142]35 or BIC^[143]36, a manual selection was made based on the
   consideration of a biologically meaningful sparsity level (allowing ~3
   links per node)^[144]12,[145]13 and multiple accuracy measures
   including true positive rate (TPR, or recall, or sensitivity),
   false-positive rate (FPR) and precision (in terms of the area under the
   ROC and PR curves). The reason for selecting a manual sparsity
   optimization over a criterion taking its place in the literature is due
   to the fact that all of these model selection criteria point out the
   best model among all the others. However, it does not guarantee the
   best model in general. On the contrary, the manual selection in this
   study with an arbitrary approximation of the outgoing links set to
   three per gene enables multiple accuracy measures to be taken into
   account together in the selection of the desired GRN.

GRN inference from the real datasets

   After the detection of the best performing method for each subset of
   each cell line as well as the GRN with the optimal sparsity level was
   made, real GRNs from the subsets of the L1000 cell lines could be
   inferred via the best performing inference methods with an expected
   level of accuracy calculated in the simulations. This was followed by
   the identification of the desired GRNs that the simulations suggest.

   Once having the “accurate GRN”, the inferred links and their
   corresponding genes should be evaluated. For this reason, a database
   search was performed in PathwaX^[146]37 for pathway enrichment
   analysis.

   In addition to this, functional class analysis was performed in terms
   of the proteins coded by the genes of interest. The outputs are
   presented and discussed in the “Results” and “Discussion” sections,
   respectively.

GRN inference methods

   In this study we used two GRN inference approaches, least-squares
   cut-off (LSCO)^[147]12 and LASSO^[148]1,[149]2, using existing wrappers
   implemented in the GeneSPIDER Matlab toolbox. The Glmnet implementation
   of LASSO was used, which applies an L1-regularized penalty for the
   sparsity of the inferred GRN. For LSCO, the wrapper implements a
   sparsity cut-off for the inferred least-squares regression coefficients
   in order to achieve a set of GRNs whose sparsities range from full to
   empty. Using both methods, 50 GRNs of different sparsity were inferred
   for each dataset and compared to the true synthetic GRNs to obtain the
   areas under the ROC and precision-recall curves. Inferred GRNs having
   ~2–5 links per gene on average were mainly considered for biological
   database validation.

Supplementary information

   [150]41540_2020_154_MOESM1_ESM.pdf^ (2.2MB, pdf)

   Supplementary Material for “Uncovering cancer gene regulation by
   accurate regulatory network inference from uninformative data”

Acknowledgements