Graphical abstract
   graphic file with name fx1.jpg
   [31]Open in a new tab
Highlights
     * •
       SCING infers gene regulatory networks (GRNs) from scRNA-seq or
       spatial transcriptomics
     * •
       Perturb-seq and other benchmarking measures highlight SCING GRN
       accuracy
     * •
       Application of SCING to Mouse Cell Atlas produced GRNs for 106 cell
       types
     * •
       Application of SCING yielded spatial and cell type mechanisms of
       Alzheimer’s disease
     __________________________________________________________________
   Biocomputational method; Transcriptomics; Machine learning
Introduction
   Understanding pathophysiology is necessary for the diagnosis and
   treatment of complex diseases, which involve the perturbation of
   hundreds or thousands of genes.[32]^1^,[33]^2^,[34]^3^,[35]^4
   Identifying perturbed gene pathways and key drivers of complex diseases
   requires the elucidation of gene regulatory networks (GRN) from high
   dimensional omics data.[36]^5^,[37]^6 Previous approaches have been
   developed and applied to identify these GRNs through bulk
   transcriptomic data and to determine causal mechanisms of
   disease.[38]^7^,[39]^8^,[40]^9 More recently, with the advent of single
   cell RNA sequencing (scRNA-seq) and spatial transcriptomics, the
   contributions of numerous genes in individual cell types have been
   implicated in diseases across many disciplines of biology and
   medicine.[41]^10^,[42]^11^,[43]^12
   GRN construction from scRNA-seq data has been tackled with limited
   success.[44]^13^,[45]^14^,[46]^15 Existing GRN tools utilize scRNA-seq
   data with thousands of pre-select genes and cells, because GRNs from
   full transcriptomes in large scRNA-seq datasets are often
   computationally intensive and intractable.[47]^13^,[48]^15 In addition,
   benchmarking studies have shown limited accuracy of existing methods on
   both synthetic and real data.[49]^13 The basis of poor performance lies
   in the technical variability in scRNA-seq data, namely high gene
   sparsity and cell-to-cell heterogeneity, which bulk GRN methods are not
   optimized to mitigate and single cell GRN methods are designed to
   overcome. Top-performing single cell GRN methods based on a recent
   benchmark study like ppcor,[50]^16 PIDC,[51]^17 and GRNBOOST2,[52]^18
   present diverse approaches to GRN construction and unique advantages
   and limitations that collectively describe the current state of single
   cell GRN methods. Although tools such as ppcor[53]^16 and PIDC[54]^17
   use partial correlation and partial information decomposition,
   respectively, to identify gene co-expression modules, few methods are
   able to identify directed networks.[55]^14 Methods that use ensemble
   machine learning approaches to train GRNs enable the modeling of
   directed networks. GENIE3 which uses random forest and GRNBOOST2[56]^18
   which uses a gradient boosting approach, top performing methods in
   recent benchmarking studies, with GRNBOOST2 showing superior computing
   efficiency than GENIE3 and better handling of scRNA-seq
   dropouts.[57]^13^,[58]^15 However, both methods generate highly dense
   networks and require immense computational resources. In addition, most
   of the regulatory edges point in both directions, and the resulting
   GRNs contain too many edges in the range of 34,000 to 47,000 edges for
   networks with only 3,000 input genes for GRNBOOST2, making this
   approach impractical for datasets with thousands of cells and full
   transcriptomes.[59]^18 SCENIC,[60]^19 an extension of GRNBOOST2, prunes
   edges based on known transcription factor binding sites (TFBS).
   However, it only focuses on regulatory behavior between transcription
   factors (TF) and its downstream target genes,[61]^19 thereby missing
   other non-TF gene regulatory mechanisms,[62]^20^,[63]^21 as well as
   regulatory patterns because of low TF expression.[64]^22 Although there
   exist other GRN inference approaches based on pseudotime
   analysis,[65]^23^,[66]^24^,[67]^25 their performance is generally
   inferior to those that rely solely on the scRNA-seq data.[68]^13
   Here, we present Single Cell INtegrative gene regulatory network
   inference (SCING), a gradient boosting based approach to efficiently
   identify GRNs in full single cell transcriptomes. The robustness of
   SCING is achieved via (1) merging and taking consensus of GRNs through
   bagging and (2) further directing and pruning edges through the use of
   edge importance and conditional mutual information. SCING GRNs are then
   partitioned into modules, to compute module specific expression for
   each cell. These modules can be used for clustering, phenotypic
   association, and biological annotation through pathway enrichment. We
   show our approach is both efficient and robust on large scRNA-seq
   datasets, able to predict perturbed downstream genes of high throughput
   perturbation experiments, and produces gene subnetworks with
   biologically meaningful pathway annotations. We evaluate our approach
   against GRNBOOST2, ppcor, and PIDC through perturbation target
   prediction in perturb-seq data, goodness of fit, network characteristic
   metrics, and disease modeling accuracy. Furthermore, we apply SCING to
   the mouse single cell atlas,[69]^26 snRNA-seq,[70]^27 and spatial
   transcriptomics[71]^28 datasets to demonstrate its versatility in
   datatype accommodation and its biological interpretability of
   high-throughput transcriptomics datasets. Our code and tutorials for
   running SCING are publicly available at
   [72]https://github.com/XiaYangLabOrg/SCING.
Results
SCING method and evaluation overview
   SCING leverages the power of abundant cell-level transcriptome data
   from scRNA-seq/snRNA-seq to identify potential directional regulatory
   patterns between genes. However, single cell transcriptomics data has
   technical issues, such as high sparsity and low sequencing
   depth,[73]^29^,[74]^30 which make the use of traditional linear or
   correlative approaches challenging. In addition, these datasets often
   contain tens of thousands of cells each with hundreds to thousands of
   genes, making identifying GRNs using complex non-linear approaches on
   full transcriptomes difficult.[75]^13 To address these limitations, we
   employ a combination of supercell, gene neighborhood based connection
   pruning, bagging, gradient boosting regression, and conditional mutual
   information approaches to identify robust regulatory relationships
   between genes based on single cell transcriptomics data ([76]Figure 1A;
   [77]STAR Methods). SCING contains four tunable hyperparameters: the
   number of supercells to reduce sparsity, the number of subsampled
   networks for bagging, the number of nearest neighbors for feature
   selection in gradient boosting regression, and the consensus edge
   overlap threshold when merging subsampled networks. We benchmarked and
   optimized these hyperparameters to balance computational efficiency,
   network properties, and robustness of the GRN based on target gene
   expression prediction ([78]STAR Methods) and provide a description of
   the benchmark framework and justification for our selection in the
   Methods.
Figure 1.
   [79]Figure 1
   [80]Open in a new tab
   SCING overview, benchmarking, and application
   (A) SCING overview. First, we select a specific cell type, or use
   spatial transcriptomics data. We then cluster the cells/spatial spots
   using the Leiden graph partitioning algorithm and merge subclusters
   into supercells. We utilize bagging through subsamples of supercells to
   keep robust edges in the final GRN. For each subsample, the genes are
   clustered based on their PC embeddings to limit likely regulatory
   edges. We then identify edges through gradient boosting regressors
   (GBR). We find the consensus as edges that show up in 20% of the
   subsample networks as the default setting, but this threshold can be
   tuned. We then prune edges and cycles using conditional mutual
   information metrics.
   (B) In silico performance testing using perturb-seq. We identify
   downstream perturbed genes of each guide RNA targeting a specific gene.
   We then predict perturbed genes at each depth in the network from the
   perturbed gene. True positive rate (TPR) and false positive rate (FPR)
   are determined at each depth in the network. We then utilize AUROC and
   TPR at FPR 0.05 as metrics for evaluation.
   (C) Gene prediction validation and network overfitting assessment. We
   split data into training and test sets and build a network on the
   training set. A gradient boosting regressor is trained for each gene
   based on its parents in the training data. We then predict the
   expression of each gene in the test set and determine the distance from
   the true expression through cosine similarity.
   (D) Biological validation through disease subnetwork modeling. We
   utilize a random walk framework from Huang et al. to determine the
   increase in performance of a GRN to model disease subnetworks versus
   random GRNs with similar node attributes.
   (E) Partition of GRNs into modules and functional annotation of
   modules. We apply the Leiden graph partitioning algorithm to identify
   GRN subnetworks and then calculate module specific expression for each
   cell using AUCell and further combine the gene modules with pathway
   knowledge bases to annotate modules with biological pathways.
   (F) Biological applications of SCING to Alzheimer’s disease (AD)
   datasets and the Mouse Cell Atlas datasets. We apply SCING to human
   prefrontal cortex snRNA-seq data with AD and Control patients, whole
   brain Visium spatial transcriptomics data for AD vs. WT mice at
   different ages, and to the Mouse Cell Atlas for 33 tissues and 106 cell
   types.
   We evaluated our approach against PIDC, a partial information
   decomposition approach; ppcor, a partial correlation approach; and
   GRNBOOST2, a gradient boosting approach. We selected these particular
   methods for comparison because of their overall better performance in
   recent benchmarking studies[81]^13^,[82]^15 and diverse approaches. We
   compared these methods on the ability to predict downstream gene
   targets of large scale Perturb-seq studies ([83]Figure 1B), robustness
   of the network on training and test data ([84]Figure 1C), metrics
   including the consistency of edge overlap on GRNs built on independent
   cells, and the ability to model disease subnetworks ([85]Figure 1D).
   Furthermore, we demonstrated the utility of using SCING on the full
   mouse cell atlas and a human prefrontal-cortex snRNA-seq dataset[86]^27
   with AD and control patients to perform snRNA-seq batch harmonization,
   gene module identification with biological annotation ([87]Figure 1E),
   and module-trait association analysis ([88]Figure 1F). Data from
   Morabito et al. has snRNA-seq from 11 AD and 7 control human prefrontal
   cortex samples, with 61,472 nuclei across 7 cell types, which provides
   a high-quality dataset for benchmarking. Furthermore, we applied SCING
   to a visium mouse dataset with AD vs. control samples.[89]^28 We show
   that the SCING subnetworks are versatile in data type accommodation
   (scRNA-seq, snRNA-seq, spatial transcriptomics), can resolve spatial
   biology, and are powerful in retrieving biologically meaningful
   pathways, gene connections, and disease associations.
SCING extends network node inclusion capacity and improves computing speed
   SCING builds many GRNs for each dataset and the speed of such
   computation is paramount to reasonable computation for a whole dataset.
   The use of supercells and gene covariance based potential edge pruning
   enables faster performance of SCING. We show that SCING improves
   computational speed over GRNBOOST2 and PIDC, when increasing the number
   of genes ([90]Figure S1A), and number of cells ([91]Figure S1B).
   GRNBOOST2 scales exponentially on the number of genes, whereas PIDC
   scales exponentially on the number of cells, making whole transcriptome
   and large dataset GRN inference difficult. Supercells in SCING ensure
   the network building run time does not increase as a function of cells
   and potential edge pruning enables linear increase in computation with
   respect to genes. We note that ppcor’s fast general matrix formulation
   improves GRN inference time compared to all other approaches, including
   SCING. Although SCING is slower than ppcor, it performs inference on
   4,000 genes in ∼21 s for all cell types, which is reasonable to compute
   hundreds of GRNs for any given sample.
SCING GRNs better predict downstream genes of perturbed genes in perturb-seq
   We tested whether GRNs from each approach can predict gene expression
   changes in downstream genes from gene knockdown treatments. Here, we
   used Perturb-seq datasets, which enabled us to identify the effects of
   many perturbed genes in parallel. We utilize previously published
   datasets with THP-1, dendritic (DC), and K562 cells with 25, 24, and 21
   genes perturbed, respectively.[92]^31^,[93]^32 The DC cells were split
   into lipopolysaccharide (LPS) stimulated and non-stimulated cells with
   perturbations targeting transcription factors (TFs) as two datasets,
   and the K562 cells were split into two datasets based on the genes
   initially perturbed (TFs or cell cycle related genes) in the
   Perturb-seq experiments. THP-1 cells contain perturbations targeting
   PD-L1 regulators.
   We identified genes downstream of each perturbation through an elastic
   net regression framework to determine the effect of RNA guides on each
   gene while regressing out cell state[94]^32 ([95]STAR Methods)
   ([96]Table S1). We compared GRNs generated from SCING to GRNs generated
   from GRNBOOST2, PIDC, and ppcor in predicting genes downstream of each
   target gene in each perturb-seq experiment. For any given network, we
   iterated through the downstream genes of an initially perturbed gene by
   the RNA guide and determined if the predicted downstream genes were
   significantly altered. We determined the true positive rate (TPR) and
   false positive rate (FPR) at each network depth to compute the area
   under the receiver operating characteristic (AUROC) curve, as well as
   the TPR at an FPR of 0.05. We examine TPR at FPR 0.05 to show
   perturbation prediction accuracy in a setting more relevant to
   biological analysis (controlling for FPR 0.05). We first examine the
   prediction performance of each GRN approach when building GRNs on
   datasets with cells removed that have zero expression of the target
   gene. Removing these cells mitigates performance effects from sparsity
   ([97]Figure S2A). Since ppcor and PIDC produce undirected graphs, and
   GRNBOOST2 generally has bidirectional edges, we first evaluated SCING
   against the other methods without considering edge direction, showing a
   higher AUROC for SCING ([98]Figure 2A). Edge direction does not affect
   this metric ([99]Figure 2B). In addition, we show SCING improves TPR at
   FPR of 0.05 ([100]Figure 2C). Edge direction only affects prediction
   accuracy in TPR at FPR 0.05 for dc 3h cells ([101]Figure 2D).
Figure 2.
   [102]Figure 2
   [103]Open in a new tab
   Performance evaluation
   (A) Predicted downstream affected genes of perturb-seq based
   perturbation in 5 datasets with GRNs built on cells with non-zero
   expression of the perturbation of interest. Area under receiver
   operator characteristic (AUROC) curve for prediction of downstream
   perturbations using undirected GRNs.
   (B) AUROC for prediction of downstream perturbation on directed GRNs
   for SCING.
   (C) True positive rate (TPR) at a false positive rate (FPR) of 0.05 for
   the prediction of downstream perturbations on undirected GRNs.
   (D) TPR at FPR of 0.05 for the prediction of downstream perturbations
   on directed GRNs for SCING.
   (E) Measure of network overfitting by ratio of cosine similarity
   between predicted gene expression and actual for testing and training
   data in held out data for astrocytes, microglia, and oligodendrocytes.
   (F) Cosine similarity of predicted gene expression and actual
   expression in testing data show few differences between SCING and
   others.
   (G) Cosine similarity of predicted gene expression and actual
   expression in training data shows other methods overfit to the training
   data.
   (H) Gene numbers captured in the cell type networks from each method.
   p-values between SCING and each of the other methods was computed with
   an unpaired t-test. (∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001, ∗∗∗∗:
   p < 0.0001 unpaired t-test).
   Across all Perturb-seq datasets tested and when no cells were removed,
   SCING either outperformed, or met the performance of the other tools in
   both AUROC ([104]Figure S2B) and TPR at FPR 0.05 ([105]Figure S2D) and
   performance was minimally affected by edge direction ([106]Figures S2C
   and S2E). For AUROC, SCING outperforms ppcor across all datasets, PIDC
   across 4 datasets, and GRNBOOST2 across 2 datasets. For TPR at FPR
   0.05, SCING outperforms ppcor and PIDC across 4 datasets and GRNBOOST2
   across all datasets ([107]Figure S2D). SCING performed best in the
   Papalexi et al. THP-1 dataset, although there is a large variance in
   both AUROC ([108]Figure S2B) and TPR at FPR 0.05 ([109]Figure S2D) for
   both ppcor and SCING.
   Overall, SCING outperforms all other approaches at predicting
   perturbation effects in Perturb-seq data when sparsity is adjusted
   ([110]Figure 2) and outperforms select methods when all cells are used
   ([111]Figure S2). Thus, we recommend removing cells with sparse gene
   expression when using SCING.
SCING mitigates overfitting and builds more robust GRNs
   Models often overfit to their data and fail to properly perform on new
   datasets. In the GRN context, we aimed to identify connections and
   networks that are able to capture biological variation rather than
   sample or batch specific effects. To test the performance of GRN
   inference approaches and their ability to capture robust biological
   signals, we tested the ability of a model trained on parents of each
   gene in training data to predict the gene expression of the downstream
   target gene in testing data. We split the scRNA-seq data from control
   human prefrontal cortex[112]^27 into training and testing sets
   ([113]STAR Methods). First, we built GRNs on the training data from
   oligodendrocytes, astrocytes, and microglia using SCING, ppcor, PIDC,
   and GRNBOOST2. Subsequently, we trained gradient boosting regressors
   for each gene based on the parents in a given network using the
   training data. The trained regressors were then used to predict the
   gene expression of cells in testing data based on the expression of the
   parent genes in those cells. We evaluated the performance of each GRN
   approach by averaging the cosine similarity score over all downstream
   genes that have parents in the network. This process was repeated for
   10 replicates on random subsamples of 1,000, 3,000, and 5,000 genes
   based on runtime feasibility ([114]Figure S3).
   To measure overfitting, we used the cosine similarity score ratio
   between the test and training sets, with a higher ratio indicating
   lower overfitting. We found that SCING GRNs had less overfitting than
   the other approaches ([115]Figures 2E and [116]S3A–S3C). In terms of
   performance in the test sets, SCING performed similarly to ppcor and
   PIDC and outperformed GRNBOOST2 ([117]Figure 2F). On training data,
   GRNs from ppcor, PIDC, and GRNBOOST2 had higher cosine similarity
   scores compared to SCING, reflecting overfitting on the training data
   by the other methods ([118]Figure 2G). We noted that the number of
   genes in the resulting network to be very low in the ppcor
   oligodendrocyte network ([119]Figure 2H), which likely affected the
   results of ppcor as evaluated by the cosine similarity measure here. In
   addition, we note that ppcor could not build networks for microglia
   with 1,000 genes ([120]Figure S3B). These results support the
   robustness of SCING GRNs with less overfitting.
SCING fits scale free model and shows edge consistency
   As another measure of GRN quality and performance, we compared GRNs
   generated by each method by various standard network metrics
   (scale-free network fit, number of edges, number of genes, and
   betweenness centrality), as well as robustness of network edges between
   networks on 50/50 split datasets. We tested this on 10 replicates for
   each of the 3 cell types (oligodendrocyte, astrocyte, and microglia) in
   the scRNA-seq data from control human prefrontal context, with 3,000
   different genes randomly selected for each subsample of cells.
   GRNs are thought to follow a scale-free network structure, in which
   there are few nodes with many connections, and many nodes with few
   connections.[121]^33 We computed scale-free network structure through
   the R-squared coefficient of a linear regression model regressing on
   the log of each node’s degree and the log of the proportion of nodes
   with that given degree. We show that the R-squared value for SCING is
   significantly higher than PIDC and GRNBOOST2 methods and is trending
   higher than ppcor, indicating SCING networks more closely follow a
   scale free network structure ([122]Figure 3A). In a typical scale-free
   network plot of log10 node count versus log10°, we expect a power law
   distribution. However, we found that networks built on scRNA-seq data
   have a parabolic distribution, with only the right half following a
   power law distribution whereas the left portion of the plot is driven
   by genes that were very sparse and likely do not fit typical
   distributions ([123]Figure 3B). Therefore, we excluded the sparse genes
   from the scale free regression calculation based on a sparsity
   threshold of 0.7.
Figure 3.
   [124]Figure 3
   [125]Open in a new tab
   Network features and consistency
   (A) Descriptive features of networks across SCING and other approaches
   for 10 networks on astrocytes, microglia, and oligodendrocytes. Linear
   regression R-squared for log degree vs. log count for goodness of fit
   metric of scale-free network.
   (B) Example scatterplot of log degree vs. log count with the average
   sparsity of genes in each dot. Brighter red indicates higher sparsity.
   This shows highly sparse genes tend to have lower degrees. We excluded
   points based on a sparsity threshold of 0.7 before computing the scale
   free regression coefficient.
   (C) Average number of edges for each method across all cell types.
   (D) The variance of the betweenness centrality across nodes in each
   graph.
   (E) The overlap score (number of overlapping edges/expected number of
   edges) in independent sets of cells. p-values between methods computed
   with an unpaired t-test between SCING and each of the other methods.
   (∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001, ∗∗∗∗: p < 0.0001). Error
   bars represent standard error.
   Among all four methods tested, PIDC produced the largest networks
   ([126]Figure 3C), followed by ppcor, GRNBOOST2, and SCING. SCING
   networks have much fewer edges than the other approaches while keeping
   similar numbers of genes ([127]Figure 2H) and better performance.
   Although network size is not a proxy of network accuracy, a smaller
   network with robust and consistent edges helps reduce overfitting
   ([128]Figures 2E, 2F, and [129]3C). One exception is that the ppcor
   oligodendrocytes network contains only 214.8 edges on average compared
   to 14,545.8, 46,891.9, and 449,850 in SCING, GRNBOOST2, and PIDC,
   respectively. Many genes without regulatory edges were not included in
   the final ppcor network. The smaller ppcor oligodendrocyte network has
   implications for the betweenness centrality and edge overlap metrics,
   as follows.
   Betweenness centrality is often used as another metric to determine the
   overall connectedness of a graph.[130]^34 For a given node, betweenness
   centrality is the number of shortest paths that pass through that node,
   indicating how much information that node presents to the graph. High
   betweenness centrality indicates that a node conveys a lot of
   information to a given graph. We found that SCING networks generally
   have higher variance of betweenness for the nodes in the networks
   ([131]Figure 3D). This indicates that some nodes are more centralized
   than others when compared to other approaches, again consistent with
   the scale-free network model.
   To determine network consistency, we split each cell type into two
   groups of non-overlapping cells. We built networks for each dataset
   using all methods and calculated the fraction of total edges that
   overlap between the two networks. Although larger networks tend to have
   more edge overlap, this also holds for larger random networks. We
   designed a normalized overlap score: the fraction of edges overlapped
   divided by the expected number of overlapping edges of a random network
   of the same size ([132]STAR Methods). When controlling for network
   size, SCING has significantly more overlap, or higher reproducibility,
   between networks of 50/50 split data than the other approaches
   ([133]Figure 3E).
SCING more accurately model disease subnetworks
   To evaluate the performance of GRNs on disease modeling, we applied an
   approach developed by Huang et al.[134]^35 Briefly, given a known
   disease gene set and a GRN, we evaluate the ability of the GRN to reach
   held out disease genes by starting from select disease genes in the
   network through random walks. We then compared the performance for each
   network to that of a random network in which the nodes follow similar
   degree characteristics to derive a performance gain measurement. Here,
   we selected known gene sets for 3 classes of diseases from
   DisGeNET[135]^36 (Immune, Metabolic, and Neuronal) ([136]Table S2) and
   obtained scRNA-seq data for cell types from 3 tissues relevant to each
   disease class from the mouse cell atlas[137]^26 (bone marrow for immune
   diseases, brain for neuronal diseases, and liver for metabolic
   diseases) ([138]Table S3). First, to reduce the number of genes, we
   filtered the scRNA-seq data by removing genes expressed in fewer than
   5% of cells and added expressed disease genes from all DisGeNET disease
   gene sets. We built GRNs using each method and evaluated the
   performance gain over random networks on the disease gene sets. We
   found that across all tissues and all disease types, SCING outperformed
   all other approaches ([139]Figure 4A).
Figure 4.
   [140]Figure 4
   [141]Open in a new tab
   Application case 1: constructing and using SCING GRNs based on Mouse
   Cell Atlas scRNA-seq datasets to interpret diseases
   (A) Performance of modeling disease subnetworks for DisGeNET gene sets
   related to the immune, metabolic, and neuronal diseases with GRNs built
   on bone marrow, brain, and liver cells, reveals SCING models disease
   subnetworks more accurately than other methods.
   (B) Clustermap depicting GRNs built with SCING from immune cell types
   (light blue), model disease subnetworks from many different disease
   gene sets, whereas vascular cell types (purple) are more specific to
   vascular diseases. Cell types (rows) from the adaptive (blue) and
   innate (orange) immune systems, show variability in the number of
   diseases (columns) they model (>0.1).
   (C and D) Clustermap shows diseases clustered with hierarchical
   clustering and sorted by the number of cell types that can accurately
   model that disease subnetwork. Diseases are colored by disease category
   (immune related: red; cardiothoracic: green; cancer: blue; immune
   related cancer: purple), and cell types are colored by innate (orange),
   and adaptive immune system (dark blue).
   (E) Innate immune system cell types better model disease subnetworks
   from more diseases. p-values between methods computed with an unpaired
   t-test. (∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001, ∗∗∗∗: p < 0.0001).
Application case 1: Constructing SCING GRNs using mouse cell atlas (MCA)
scRNA-seq datasets to interpret diseases
   After establishing the performance of SCING GRNs using the various
   approaches described above, we established an SCING GRN resource for
   diverse cell types and tested the broader utility of SCING to produce
   biologically meaningful GRNs. To this end, we applied SCING to generate
   GRNs for all cell types with at least 100 cells in all tissues of the
   MCA. We constructed a total of 273 cell-type specific networks,
   across 33 tissues and 106 cell types. To identify which GRN informs on
   which disease, we applied the above random walk approach from Huang
   et al. and summarized the results in ([142]Figures S4A–S4C,
   [143]Table S4). We found clusters of cell type GRNs defined by DisGeNET
   diseases that had similar patterns ([144]Figures S4A and [145]4B). Some
   disease genes can be modeled well using GRNs from numerous cell types
   ([146]Figure S4A) whereas others are more cell type or tissue specific
   ([147]Figure S4B). In addition, some cell type GRNs are able to model a
   broad range of diseases ([148]Figures S4A and S4C). We found that
   immune cell type (light blue squares in [149]Figure 4B) GRNs can model
   a wide range of diseases, whereas non-immune cell type GRNs (light
   purple squares in [150]Figure 4B) are more specific to vasculature
   related diseases.
   We further explored the dynamics of GRNs of immune cell types across
   all diseases in DisGeNET. We clustered the cell types in the
   performance gain matrix with only immune cell types included, and
   sorted the GRNs by the number of diseases they can accurately model
   ([151]Figures 4C, 4D, [152]S5A, and S5B). We noticed that cell types of
   the innate immune system can model a broader range of diseases than
   those of the adaptive immune system[153]^37 ([154]Figure 4E).
   Our SCING cell type GRNs resource and the above patterns of
   relationships between cell type GRNs and diseases support the utility
   of the SCING cell type GRN in disease interpretation. The networks can
   be accessed at [155]https://github.com/XiaYangLabOrg/SCING to
   facilitate further biological mining of complex diseases.
Application case 2: Using SCING GRNs to interpret Alzheimer’s disease (AD)
   We next applied SCING to a single nuclei RNA-seq (snRNA-seq) dataset
   from Morabito et al. that examined human prefrontal cortex samples from
   AD and control patients to evaluate the applicability of SCING GRNs in
   understanding AD pathogenesis.[156]^27 We focused on microglia because
   of their strong implication in AD[157]^38^,[158]^39 and a better
   current understanding of the genes and biological pathways in microglia
   in AD, to demonstrate that SCING GRNs can retrieve known biology.
   The SCING microglia GRN contained 10,159 genes and 63,056 edges. Using
   the Leiden clustering algorithm,[159]^40 we partitioned the SCING
   microglia GRN into 21 network modules. Next, we summarized module-level
   expression for each cell using the AUCell method from SCENIC on the
   partitioned GRN modules.[160]^19 When cells were clustered based on the
   raw gene expression values, as is typical with human samples, cells
   from individual samples clustered together ([161]Figure 5A), making it
   difficult to isolate sample heterogeneity from biological variability.
   However, when using SCING module expression to cluster cells, the
   sample, batch, and RNA quality effects were mitigated ([162]Figures 5A
   and 5B). In contrast, biologically relevant variation, such as sex, AD
   diagnosis, and mitochondrial fraction were better retained
   ([163]Figures 5C and 5D). In the UMAP control cells tend to localize to
   the right side, whereas cells from females tend to localize to the top
   part. These results suggest that SCING GRNs have intrinsic ability to
   correct for non-biological variations.
Figure 5.
   [164]Figure 5
   [165]Open in a new tab
   Application case 2: Using SCING GRNs to interpret Alzheimer’s disease
   (AD)
   (A) UMAP representation of scRNA-seq data shows sample specific
   differences when operating on gene expression space (left panel).
   Dimensionality reduction on SCING module embeddings removes sample
   specific effects (right panel).
   (B) SCING removes RNA quality effects on gene expression clustering.
   (C and D) Clustering on SCING modules, keeps biologically relevant
   features such as AD status (left panel) and sex (right panel), as well
   as mitochondrial fraction.
   (E) Heatmap showing coefficients of linear regression of diagnosis,
   plaque, and tangle status, on module expression, while regressing out
   sex. Pathway annotations for significant (∗: FDR <0.05) modules are
   provided.
   (F) Subnetwork reveals collocalization of canonical Alzheimer’s genes
   and subnetworks such as APOE and APP.
   To quantitatively evaluate how well SCING GRNs can be used for batch
   effect correction and biological preservation, we compared SCING GRNs
   with commonly used batch correction methods, such as FastMNN,[166]^41
   Harmony,[167]^42 and Seurat,[168]^43 chosen based on their better
   performance in previous benchmarking studies.[169]^44 We first
   performed dimension reduction and clustering of cells based on
   corrected data from each batch correction method. For SCING, the values
   used were SCING GRN module AUCell scores. We took each cell, determined
   how many neighbors in the PC space had the same annotation of interest
   (sample, batch, diagnosis, etc.), and then scored each batch correction
   approach by the fraction of cells that had the same annotation. We
   removed batch and sample specific effects using the F1 score ([170]STAR
   Methods).[171]^44 We found that the SCING GRN module based
   dimensionality reduction carried the ability to correct for batch
   effect and retain biological information ([172]Figure S6) in a similar
   manner to dedicated batch correction methods such as FastMNN,[173]^41
   Harmony,[174]^42 and Seurat.[175]^43 Although SCING GRN based batch
   correction was not as optimized as the dedicated batch correction
   methods, many typical batch correction methods do not provide a batch
   corrected gene expression matrix for downstream analysis.[176]^44 SCING
   GRN modules have direct biological interpretability without prior batch
   correction, since each SCING module can be associated with phenotypic
   traits but not batch and annotated with pathways, as described below.
   We further note that SCENIC[177]^19 has shown similar robustness to
   batch effect and that this is likely a general characteristic of
   GRN-based dimensionality reduction approaches that is not specific to
   SCING. This general characteristic offers network methods an important
   advantage over individual gene-based analysis.
   We identified SCING GRN modules associated with AD diagnosis, plaque
   stage, and tangle stage through linear regression of the phenotypic
   traits and each module’s expression across cells, while regressing out
   sex specific differences. We found that ∼43% of modules were
   significantly (FDR <0.05) associated with at least one trait and ∼24%
   of modules were significantly associated with all three traits. To
   examine the biological interpretation of these modules, we performed
   pathway enrichment on the genes in each module utilizing the GO
   biological process, DisGeNET, Reactome, BioCarta, and KEGG knowledge
   bases. We found that 78% of the significantly trait-associated modules
   were significantly enriched for biological pathways ([178]Figure 5E,
   [179]Table S5). These modules recapitulated pathways related to AD
   (module 9), immune processes (modules 0, 2, 9, 13, and 19), cytokine
   triggered gene expression (modules 12, 18, 19), and endocytosis
   (modules 2, 9, and 13). These are expected perturbed pathways for
   microglia in AD.[180]^38
   We dug into the vesicle mediated transport pathway from module 2 and
   visualized the network with the differential gene expression of each
   gene ([181]Figure S7). This microglia pathway is important in
   AD.[182]^45 In addition, we found the APOE and APP subnetwork within
   the vesicle mediated transport pathway which are among the most
   significant genetic risk factors for AD
   ([183]Figure 5F).[184]^46^,[185]^47^,[186]^48
   Therefore, our results demonstrate that SCING GRNs can correct for
   batch effects intrinsic to scRNA-seq studies and can recapitulate known
   cell type specific genes, pathways, and network connections.
Application case 3: Using SCING to model GRNs based on 10x Genomics Visium
spatial transcriptomics data to interpret AD
   To evaluate the applicability of SCING beyond scRNA-seq and snRNA-seq,
   we next applied SCING to spatial transcriptomics data from Chen
   et al.[187]^28 as a new approach for spatial transcriptomics analysis.
   We built an SCING GRN on all spots in the visium data to obtain a
   global GRN with 128,720 edges across 15,432 genes. In addition, Chen
   et al. profiled beta amyloid plaque with immunohistochemistry, and we
   included this protein expression value in the SCING GRN, highlighting
   the possibility of constructing multi-omics networks using SCING.
   We partitioned the genes in the resulting SCING GRN with the Leiden
   graph partition algorithm into 33 modules and performed the AUCell
   score from SCENIC[188]^19 to obtain module specific expression for each
   spot and annotated the enriched pathways for each module
   ([189]Table S6). We found module specific expression in the mouse brain
   subregions ([190]Figures 6A, 6B, and [191]S8), based on clustering of
   the average module expression across spots in each brain subregion
   ([192]Figure S9). We found cortical subregions to cluster together, as
   well as the thalamus and hypothalamus ([193]Figure S9), based on GRN
   module expression patterns. We also identified modules more
   specifically expressed in the cortex and hippocampus (CS, HP) (module
   12), or the fiber tract, thalamus, and hypothalamus (BS) (module 14)
   ([194]Figures 6A and 6B). Module 12 was highly enriched for genes
   involved in neuronal system, axonogenesis, and chemical synapse, which
   might reflect the dynamic status of hippocampus and cortex neurons for
   memory formation and cognitive function. In contrast, module 14 was
   enriched in genes involved in myelination[195]^49 and blood-brain
   barrier,[196]^50 consistent with the high enrichment of oligodendrocyte
   populations in the fiber tracts, and indicated blood-brain barrier
   changes in the thalamus. We also found modules that separate more
   similar subregions from one another such as module 27 (enriched for
   axonogenesis) and 21 (enriched for calcium ion transport) expressed
   much higher in the thalamus than in the hypothalamus ([197]Figure S9).
   By contrast, modules, such as module 5 (enriched for chromatin
   organization and peroxisomal lipid metabolism) and module 19 (enriched
   for ribosomal biogenesis and protein processing), are much less
   specific to subregions ([198]Figures S8 and [199]S9). These are general
   cellular functions that are expected to have broad expression across
   the brain.
Figure 6.
   [200]Figure 6
   [201]Open in a new tab
   Application case 3: Using SCING to model GRNs based on 10x Genomics
   Visium spatial transcriptomics data to interpret AD
   (A) Boxplots show regional specificity of specific modules, namely
   module 12 and 14 (red boxes).
   (B) Regional specificity of module 12 (hippocampus and cortex, enriched
   for neuronal system pathways), and module 14 (hypothalamus, thalamus,
   and fiber tract, enriched for microglial activation and cell migration
   pathways), visualized on brain samples of 3-month-old WT mice. Enriched
   pathways were labeled at the bottom of each module.
   (C) Module association in AD versus WT mice. Red boxes indicate modules
   9 and 25, which are further visualized.
   (D) Visualization of AD-associated module 9 (enriched for
   neurodegeneration) in 18-month-old WT and AD mice.
   (E) Visualization of AD-associated module 25 (enriched for microglia
   activation and cell migration) in 18-month-old WT and AD mice.
   (F) Module 25 subnetwork of Trem2 and complement proteins shows
   microglial association of module 25. Full subnetwork in
   [202]Figure S11. Nodes are colored by marker gene status of neuronal
   (blue), microglial (red), or both (purple), as determined from the
   Allen Brain Atlas. Cross cell type communication edges seen between red
   and blue.
   (G and H) Pearson correlation between module expression of each module
   and plaque (G) or age (H). Significance values for each module are
   determined by a null distribution generated from 1,000 random
   subsamples of genes with the same gene number of a given module. The
   AUCell was computed for each random module and Pearson correlation
   coefficients were calculated. The p value for the true module was
   computed based on the null distribution of the correlation
   coefficients, and p values were further corrected for multiple testing
   using Bonferroni correction. Significance is shown (∗: adjusted p value
   <0.05). Subpanel a,c: (∗: p < 0.05, ∗∗: p < 0.01, ∗∗∗: p < 0.001, ∗∗∗∗:
   p < 0.0001 by unpaired t-test).
   We leveraged the Allen Brain Atlas to determine the marker gene
   proportion of each SCING module based on the genes in the module. We
   determined marker genes for each of neurons, microglia/perivascular
   macrophages, astrocytes, and oligodendrocytes from the Allen Brain
   Atlas scRNA-seq data. We then determined the proportion and enrichment
   of genes in each module that are marker genes for each cell type and
   plotted the data as heat maps ([203]Figure S10). We see that many
   modules are mostly made up of neuronal genes, however some modules have
   higher contributions from oligodendrocytes (modules 14, 26), and
   microglia (modules 25, 16, 19, 20). This indicates that utilizing high
   quality reference data can help to deconvolve cell type membership of
   SCING modules.
   We found many modules to be AD associated ([204]Figure 6C), in
   particular, module 9 (enriched for neurodegeneration) ([205]Figure 6D)
   and module 25 (enriched for microglial activation, lysosome, and cell
   migration) ([206]Figure 6E). We further explored these subnetworks and
   found most of the genes in module 9 to be neuronal marker genes,
   whereas most of the genes in module 25 to be microglial marker genes
   ([207]Figure S11). The module 25 subnetwork to contain the Trem2 and
   C1q subnetworks, highly profiled in AD microglial cells
   ([208]Figure 6F). Of interest, we found a cross-module edge and several
   cross-cell-type edges, likely revealing intercellular communications.
   We identified 6 modules that were significantly associated with amyloid
   beta plaque ([209]Figure 6G) through Pearson correlation and null
   distribution from a permutation approach ([210]STAR Methods). Modules
   25 (enriched for immune function) and 19 (enriched for nonsense
   mediated decay, and infectious disease) were among the most highly
   correlated with plaque and were also associated with AD
   ([211]Figure 6C).
   We found that amyloid plaque staining could be partitioned based on the
   expression of module 10 in the SCING GRN ([212]Figure S12), which had
   significantly higher expression in AD mice than in WT ([213]Figure 6C),
   and is also quantitatively correlated with plaque ([214]Figure 6G).
   Module 10 contains genes highly related to metabolism,
   neurodegenerative diseases, and immune function ([215]Table S6). We
   show the subnetwork for the amyloid beta plaque in module 10
   ([216]Figure S12). In addition, module 25 (enriched for microglial
   activation, lysosome, and cell migration) was also highly associated
   with plaque ([217]Figure 6G), as expected in AD pathogenesis.
   In addition to plaque association, we explored the age-related modules
   through Pearson correlation between age (6, 12, and 18 months) and
   module expression in WT samples. We found 10 modules to be age related
   ([218]Figure 6H). Two highly positively associated modules with age,
   module 10 (subnetwork containing amyloid protein) and 13, have
   age-related pathways. Module 10 is involved in protein degradation,
   neurodegeneration, and cell cycle, and module 13 is involved in
   metabolism, cellular response to stress, and immune system
   ([219]Figure 6H, [220]Table S6).
   We found certain modules such as module 30 (enriched for neuropeptide
   signaling) to be spatially variable ([221]Figure S13) and associated
   with AD ([222]Figure 6B). Based on the locations from the Allen Brain
   Atlas[223]^51 ([224]Figure S13A), we find module 30 to be more specific
   to the hypothalamus than other regions of the BS ([225]Figure S13B). We
   also find that within the hypothalamus, module 30 expression is higher
   in 18-month-old AD mice compared to the WT mice ([226]Figure S13C).
   Hypothalamic alterations have been observed in the hypothalamus in AD
   development.[227]^52 The module 30 subnetwork ([228]Figure S13D) shows
   key hypothalamic neuropeptides, such as Pomc[229]^53 and Pnoc[230]^54
   which are consistent with their enrichment in the hypothalamus.
   Our applications of SCING to spatial transcriptomics data demonstrate
   its broader utility beyond scRNA-seq/snRNA-seq and revealed spatial
   network patterns of AD.
Discussion
   Single cell multiomics has become a powerful tool for identifying
   regulatory interactions between genes, but the performance of existing
   tools is limited in both accuracy and scalability.[231]^13 Here, we
   present SCING, a gradient boosting, bagging, and conditional mutual
   information based approach for efficiently extracting robust GRNs on
   full transcriptomes for individual cell types. We validate our networks
   using a novel perturb-seq based approach, held-out data prediction, and
   established network characteristic metrics (network size, network
   overlap, scale-free network, and betweenness centrality) to determine
   performance and network features against other existing tools. SCING
   not only offers robust and accurate GRN inference and improved gene
   coverage and speed compared to previous approaches,[232]^9^,[233]^18
   but also versatile GRN inference with scRNA-seq, snRNA-seq, and spatial
   transcriptomics data. Using various application examples, we show that
   SCING infers robust GRNs that inform on cell type specific genes and
   pathways underlying pathophysiology while simultaneously removing
   non-biological signals from data quality, sample, and batch effects
   through gene regulatory module detection and functional annotation. We
   also provide a comprehensive SCING GRN resource for 106 cell types
   across 33 tissues using data from MCA to facilitate future applications
   of single cell GRNs in our understanding of pathophysiology.
   SCING efficiently identifies robust networks using supercells, a
   bagging approach, and mutual information-based edge pruning, to remove
   redundant edges in the network. The supercell and reduction in
   potential edges make the bagging approach possible by removing
   computational time for each GRN. GRNBOOST2 uses a similar framework as
   SCING (gradient boosting regression) but overfits the data by
   generating too many edges to be interpretable, which are also
   undirected, making them less useful for biological interpretation.
   Meanwhile, ppcor and PIDC use partial correlation and partial
   information decomposition approaches, which are more accurately
   described as measures of coexpression, rather than gene regulation. In
   contrast, the directed graphs built with SCING show better perturbation
   prediction and consistency across replicates.
   Validation of GRN inference tools has remained challenging.[234]^13
   Since ground truth networks are unknown for many conditions across cell
   types and tissues, current GRN accuracy evaluations rely on inferring
   synthetic ground truth networks that may not capture biological or
   curated interactions that forsake tissue and cell-type
   specificity.[235]^13 Our novel Perturb-seq based approach provides a
   unique way to determine GRN accuracy, since it leverages cells with
   specific gene perturbations to infer downstream genes. Prediction of
   perturbed genes is a very powerful aspect of GRN construction, and
   SCING stands out above all other methods in this regard.[236]^55 This
   is the key difference between our benchmarking framework and that used
   in BEELINE by Pratapa et al., whereas the other aspects of the
   benchmarking framework are conceptually similar.
   We demonstrate that SCING networks are applicable to scRNA-seq,
   snRNA-seq, and spatial transcriptomics data. Our network-based module
   expression is robust to batch effects and provides biologically
   annotated expression values for each cell that can be directly used for
   disease modeling ([237]Figure 4), phenotypic correlation
   ([238]Figure 5), and further spatial analysis ([239]Figure 6) to boost
   our ability to interpret single cell data. We note that SCING for
   spatial transcriptomics network analysis does not currently utilize the
   spatial information during network construction, although the resulting
   network modules intrinsically carry a certain level of spatial
   information ([240]Figure 6). Explicitly leveraging the spatial
   information will likely enhance network performance and is an important
   future direction to improve SCING for network modeling of spatial
   transcriptomics data.
   At the intersection of single cell omics and complex diseases, SCING
   provides sparse but robust, directional, and interpretable GRN models
   for understanding biological systems and how they change through
   pathogenesis. GRNs can be analyzed to identify and predict perturbed
   subnetworks, and as a result, be used to investigate key drivers of
   disease.[241]^56 Identifying key drivers of disease by teasing apart
   biology and technical variation from high throughput, high dimensional
   datasets will lead to more successful drug and perturbation target
   identification, as well as robust drug
   development.[242]^57^,[243]^58^,[244]^59
   We note that SCING is currently tested to infer GRNs based on
   individual scRNA-seq, snRNA-seq, and spatial transcriptomics data.
   Other types of omic information such as scATAC-seq, scHi-C, or cell
   type specific trans-eQTL information can be included in SCING to
   further inform on regulatory structure to refine and improve on
   GRNs.[245]^9^,[246]^60^,[247]^61^,[248]^62^,[249]^63 Information from
   multiple data types will become an integral part of the systems
   biology, and future efforts to properly model multiomics data
   simultaneously to inform on complex disease are warranted.
Limitations of the study
   Although we show high utility of SCING through a number of evaluation
   methods, the following limitations should be considered when applying
   the method. First, SCING infers GRNs based on observational scRNA-seq
   and spatial transcriptomics data and the links between genes are not
   necessarily causal or directional. Second, there are many
   hyperparameters that require selection, and while we show robustness
   across a range of hyperparameters, selection can be dataset dependent
   and optimization of hyperparameters is encouraged. Thirdly, currently
   the spatial information of spatial transcriptomics is not explicitly
   utilized in SCING GRN construction despite the fact that the resulting
   GRNs show spatial patterns. We finally note that additional omics
   information (i.e., scATAC-seq, scHi-C, ChIP-seq, or cis/trans eQTL,
   etc.) can provide additional regulatory information that would improve
   GRN accuracy and will be considered in future development of SCING.
STAR★Methods
Key resources table
   REAGENT or RESOURCE SOURCE IDENTIFIER
   Software and algorithms
     __________________________________________________________________
   Python 3.8.13 Van Rossum et al.[250]^64 [251]https://www.python.org/
   Numpy 1.21.6 Harris et al.[252]^65 [253]https://numpy.org/
   Pandas 1.4.2 McKinney et al.[254]^66 [255]https://pandas.pydata.org/
   Matplotlib 3.5.2 Hunter et al.[256]^67 [257]https://matplotlib.org/
   Scanpy 1.9.1 Wolf et al.[258]^68 [259]https://github.com/scverse/scanpy
   Scikit-learn 1.1.1 Pedregosa et al.[260]^69
   [261]https://scikit-learn.org/stable/index.html
   Scipy 1.8.1 Virtanen et al.[262]^70 [263]https://scipy.org/
   Statsmodels 0.13.2 Seabold et al.[264]^71
   [265]https://www.statsmodels.org/stable/index.html
   Dask/Distributed 2022.7.0 Rocklin et al.[266]^72 [267]https://dask.org
   Pyitlib 0.2.2 Peter Foster and Michael Milton[268]^73
   [269]https://github.com/pafoster/pyitlib
   Leidenalg 0.8.10 Traag et al.[270]^40
   [271]https://leidenalg.readthedocs.io/en/stable/
   Python_igraph 0.9.11 Csardi et al.[272]^74 [273]https://igraph.org/
   Networkx 2.8.3 Hagberg et al.[274]^75 [275]https://networkx.org/
   Ctxcore 0.1.1 Bravo González-Blas et al.[276]^76
   [277]https://ctxcore.readthedocs.io/en/latest/
   Pyscenic 0.11.2 Aibar et al.[278]^19 [279]https://scenic.aertslab.org/
   R 4.1.2 R Core Team (2022).[280]^77 [281]https://www.r-project.org/
   enrichR 3.0 Kuleshov et al.[282]^78
   [283]https://www.rdocumentation.org/packages/enrichR/versions/3.0
   Ppcor Kim, 2015[284]^16
   [285]https://www.rdocumentation.org/packages/ppcor/versions/1.1
   PIDC Chan et al., 2017[286]^17
   [287]https://github.com/Tchanders/NetworkInference.jl
   GRNBOOST2 Moerman et al., 2019[288]^18
   [289]https://arboreto.readthedocs.io/en/latest/installation.html
   Random Walk Huang et al., 2018[290]^35
   [291]https://github.com/idekerlab/Network_Evaluation_Tools
   SCING This paper/GitHub [292]https://github.com/XiaYangLabOrg/SCING
     __________________________________________________________________
   Deposited data
     __________________________________________________________________
   Human Alzheimer’s Disease snRNAseq Morabito et al., 2021[293]^27
   [294]https://www.synapse.org/#!Synapse:syn22079621/
   Dendritic cell and k562 cell line perturb-seq datasets Dixit et al.,
   2016[295]^32 GEO: [296]GSE90063
   THP-1 cells Papalexi et al., 2021[297]^31 GEO: [298]GSE153056
   Mouse Cell Atlas Han et al., 2018[299]^26
   [300]https://bis.zju.edu.cn/MCA/
   DisGeNET Piñero et al., 2019[301]^79 [302]http://www.disgenet.org/
   Mouse AD Visium Chen et al., 2020[303]^28 GEO: [304]GSE152506
   [305]Open in a new tab
Resource availability
Lead contact
   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Xia Yang
   ([306]xyang123@g.ucla.edu).
Materials availability
   This study did not generate new reagents.
Method details
SCING method overview
   To reduce the challenges from data sparsity from single cell omics, as
   well as reduce computational time, we first used supercells which
   combine gene expression data from subsets of cells sharing similar
   transcriptome patterns. To improve the robustness of GRNs, we built
   GRNs on subsamples before merging the networks, keeping edges that
   appear in at least 20% of networks. Lastly, we removed cycles and
   bidirectional edges where one direction was >25% stronger than the
   other direction and pruned the network using conditional mutual
   information to reduce redundancy. These steps are described in more
   detail below. To benchmark the performance of SCING, we selected three
   existing methods, namely GRNBOOST2, ppcor, and PIDCm with default
   parameters. These methods were selected based on previous benchmarking
   studies where superior performance of these methods were
   supported.[307]^13
Preprocessing
   For supercell preprocessing, we first normalized the data for a total
   count number of 10,000 per cell using the pp.normalize_total function
   from Scanpy.[308]^68 We then took the log1p of the gene expression
   values, using Scanpy’s pp.log1p function. We then identified and subset
   to the top 2000 highly variable genes using Scanpy’s
   pp.highly_variable_genes function.[309]^68 Data were then centered and
   scaled using Scanpy’s pp.scale function and further used to compute the
   nearest neighbor embedding with the default 10 neighbors using Scanpy’s
   pp.neighbors function on the top 20 principal components of the gene
   expression, calculated using Scanpy’s tl.pca function.[310]^68
   For network building preprocessing, we normalized the total number of
   counts in each cell to 10,000 using Scanpy’s pp.normalize_total
   function and took the natural log of the gene expression using Scanpy’s
   pp.log1p function.[311]^68 We removed genes not expressed in any cells
   and any duplicate genes. We transposed, centered, and scaled the data
   using Scanpy’s pp.scale function before running principal component
   analysis (PCA) using Scanpy’s tl.pca function.[312]^68 This provides us
   with low dimensional embeddings for each gene. A nearest neighbor
   algorithm from scikit-learn[313]^80 was used to find the nearest
   neighbors of each gene. The potential regulatory relationship between
   genes was limited to the 100 nearest neighbors for each downstream
   gene.
Supercell construction
   We define a supercell as a pseudobulk generated from a cluster of
   cells, determined through Leiden clustering on the nearest neighbors
   connectivities graph of cells in the principal component space of the
   gene expression data. For each dataset, we perform preprocessing as
   described above. We then used the Leiden graph partitioning algorithm
   to separate cells into groups using Scanpy’s tl.leiden
   function.[314]^68 The Leiden resolution is determined by the user input
   specifying the desired final number of supercells using binary search
   on the resolution parameter, with higher resolution leading to more
   supercells and lower resolution leading to fewer supercells. Here, we
   used 500 supercells, which balances runtime, with dataset summarization
   ([315]Figure S14). Benchmarks for run time, supercell mean and variance
   in sparsity, and network robustness suggested 500 supercells to be
   optimal for balancing the tradeoff between sparsity mitigation and
   network performance ([316]Figure S14). We then merged each group of
   cells into a supercell by averaging the gene expression within each
   group.
GRN inference in SCING
   For each downstream gene, we trained a gradient boosting
   regressor[317]^18 to predict the expression using nearest neighbors as
   potential upstream regulators. For each upstream gene, we form a
   directed edge to the downstream gene with a corresponding edge weight
   based on the feature importance of that upstream gene in predicting the
   downstream gene in the gradient boosting regressor. We keep the top
   10 percent of edges based on edge weight to remove edges with very low
   weights. For each gradient boosting regressor, we used 500 estimators,
   max depth of 3, learning rate of 0.01, 90 percent subsample, the square
   root of the total number of features as max features, and an early stop
   window of 25 trees, if the regressor was no longer showing improved
   performance.[318]^18
SCING parameter selection
   Parameter selection was used to retain biological accuracy while
   limiting computational cost. We determined that using 500 supercells
   and 100 GRNs per merged network balanced computational resourcefulness
   with robustness. If computational cost is not an issue, supercells are
   not necessary and more GRNs can be built as intermediates. Since GRNs
   are typically built within a single cell type, we use 10 principal
   components (PCs) for determining gene covariance. Typically in scRNAseq
   analysis more PCs are used, but there is less variation overall within
   one cell type. We determined 100 neighbors to be chosen for each gene,
   again balancing computational cost.
SCING parameter benchmark framework and selection
   SCING contains four tunable hyperparameters: the number of supercells,
   the number of subsampled networks for bootstrapping, the number of
   nearest neighbors for feature selection in gradient boosting
   regression, and the consensus edge overlap threshold when merging
   subsampled networks. We designed a pipeline to compare computational
   efficiency, network properties, and GRN robustness in gene expression
   prediction accuracy across different parameter settings
   ([319]Figure 1C). Optimal parameters were defined based on the balance
   of these metrics, ideally resulting in fast run time, more genes in the
   network, and high prediction accuracy. We tested one parameter at a
   time while fixing others at their default values. We tested the number
   of supercells at 100, 300, 500, 700, and 900 ([320]Figure S14); number
   of nearest neighbors at 30, 50, 100, 200, and 400 ([321]Figure S15);
   number of subsampled networks at 30, 50, 100, 200, 300, and 400
   ([322]Figure S16); and edge consensus threshold proportions at 0.05,
   0.1, 0.2, 0.5, and 0.8 ([323]Figure S17). Networks were generated under
   these settings on all genes in the oligodendrocytes from Morabito
   et al.27, since this cell type contained the most cells in the dataset,
   for building networks that are representative of SCING performance. We
   performed network robustness on prediction accuracy with the same
   protocol used to compare the cosine similarities of predicted and
   actual gene expression between SCING and other GRN methods, which is
   detailed in “Network robustness evaluation based on training and held
   out testing data” in the methods section. The network characteristics
   and run times were calculated under the same framework as the
   cross-method comparisons, which is detailed in “Computation of time
   requirements” in the methods section.
   As the supercells were used to mitigate gene sparsity, we used low mean
   supercell gene sparsity and low variance in supercell gene sparsity as
   an initial guide for our selection of 500 supercells. The benchmark
   results coincided with this general heuristic, since 500 supercells
   exhibited a balance of high relative cosine similarity for target gene
   expression prediction and reasonable GRN construction run time
   ([324]Figure S14). Given our robust supercell selection, the number of
   subsampled networks had little effect on SCING performance, with any
   subsample size above 30 networks yielding consistent accuracy. Run
   times generally increased with subsampled networks, but the networks
   were largest at 100 subsampled networks. While any size above 30 is an
   acceptable choice, we decided on 100 subsampled networks to balance run
   time and network size ([325]Figure S16). For the nearest neighbor
   selection, we observed comparable cosine similarity across all
   parameter values, faster run time for GRNs with lower numbers of
   neighbors, and denser networks with higher numbers of neighbors
   ([326]Figure S15). To balance efficiency and network size, we decided
   the default setting to be 100 neighbors. Lastly, our consensus
   parameter benchmark revealed that GRNs with higher edge consensus
   thresholds yield smaller and sparser networks with lower gene counts,
   quicker run times, and slightly improved gene prediction accuracy and
   overfitting. While stringent thresholds remove edges that would reduce
   overfitting and increase prediction accuracy for downstream genes in
   the networks, they run the risk of losing regulatory information, since
   these smaller networks omit many genes and cannot predict their
   expression. In order to comprehensively explain a cell type’s
   regulatory landscape, we decided to include more genes in our networks
   without greatly sacrificing performance ([327]Figure S17). For these
   reasons, we set the consensus threshold to 20% for our study.
Merging GRNs from data subsamples
   We subsampled supercells from the supercell data without replacement
   and built one network for each subsample. For each subsampled network,
   we kept the top 10 percent of edges based on the edge weights, which
   are the feature importance measures of upstream genes in predicting
   downstream genes in the trained gradient boosting regressor, to reduce
   the number of edges with very low edge weights. We also kept the top 3
   edges for each downstream gene to reduce any gene specific bias caused
   by feature importance.[328]^81 Of these edges, we kept the edges that
   appeared in at least 20% of all networks from the subsamples into a
   merged network. The threshold of 20% was based on the testing results
   of multiple thresholds ([329]Figure S17), which showed this threshold
   balances accuracy, network size, and run time. For the edges that were
   kept in the merged network, the edge weights were the sum of the
   weights for that edge across the subsampled networks. We also removed
   reversed edges if the edge with a higher weight was at least 25%
   stronger than that of the weaker reverse direction. Otherwise, we kept
   the edge bidirectional. We removed cycles with more than two edges in
   the graph by removing the edge with the lowest edge weight.
   Additionally, we removed triads in the network based on the
   significance of the conditional mutual information to remove redundant
   edges between genes that may result primarily from sharing an upstream
   regulator in the network. The p value of the conditional mutual
   information was based on the chi-squared distribution.[330]^82 If an
   edge between two genes was not statistically significant given a parent
   of both of the genes, then the edge was removed.
Quantification and statistical analysis
Other GRN methods
   We benchmarked SCING against GRNBOOST2, ppcor, and PIDC. Default
   parameters were used for all existing approaches unless otherwise
   specified.
   For GRNBOOST2,[331]^18 we ran this approach by predicting the
   expression of all genes from all other genes. We then took the top 10%
   of edges to reduce the number of edges with extremely low importance
   (e.g. 10^−17).
   For ppcor,[332]^16 due to the sparse non-linear nature of scRNAseq
   connections, we ran the approach using spearman correlation. We only
   kept edges with a Benjamini-Hochberg FDR <0.05.
   For PIDC,[333]^17 according to their tutorial, we used a threshold of
   0.1, to keep the top 10% of highest scoring edges.
Datasets
   For the Perturb-seq validation, we used datasets from Dixit
   et al.[334]^32 and Papalexi et al.[335]^31 Dixit et al. has 24
   transcription factors perturbed in dendritic cells and 25 cell cycle
   genes targeted in K562 cells, while Papalexi et al. has 25 PD-L1
   regulators perturbed in THP-1 cells.
   For the train-test split and network consistency assessment, we used
   the human AD snRNAseq data from Morabito et al.[336]^27 This adds
   another slightly different data type from the scRNAseq in the
   perturb-seq and MCA and is later used for biological application with
   microglial cells in AD.
   We used the mouse cell atlas[337]^26 scRNAseq database, since it has a
   large number of cell types (106 cell types) across numerous tissues (33
   tissues), to test 446 disease associations with the random walk
   approach from Huang et al.[338]^35 This additionally provides a
   resource of GRNs throughout cell types of the entire mouse.
   Finally, to test the applicability of SCING on spatial transcriptomics
   data, we used the mouse AD dataset from Chen et al.[339]^28 This
   dataset contains AD and WT mice from various age groups (3, 6, 12,
   18 months), in addition to amyloid beta plaque staining.
Computation of time requirements
   We determined the run time to build a GRN from each approach on subsets
   of cells and genes with varying cell numbers and gene numbers. All
   tests were performed on a ryzen 9 3900X 12-Core processor with
   64Gb RAM. We determined the speed on 10 iterations of randomly selected
   genes (1000, 2000, 4000) with 1000 cells each, and on randomly selected
   cells (250, 500, 1000) with 1000 genes each.
Overview of network robustness evaluation based on perturb-seq datasets
   Briefly, to determine the accuracy of the GRNs from each approach with
   Perturb-seq data, we first identified significantly altered genes
   downstream of each guide RNA perturbation through an elastic net
   regression approach,[340]^32 as detailed below. We then determine the
   accuracy of a given network by identifying the true positive rate (TPR)
   and false positive rate (FPR) at each depth in the network. The AUROC
   and TPR at FPR 0.05 were determined for each network on a given
   perturbation. More details on each step are below.
Computation of guide RNA perturbation coefficients
   First, we downloaded the Perturb-seq data from Papalexi et al.[341]^31
   and Dixit et al.[342]^32 We then followed the steps as described by
   Dixit et al.[343]^32 to compute guide RNA perturbation coefficients,
   which indicate the effects of a specific guide RNA (single
   perturbation) on other genes.
   As cell state can affect gene perturbation efficiency, to determine
   cell states and remove state specific perturbations, we first separated
   out unperturbed cells which were not transfected with positive guide
   RNAs. We then clustered these unperturbed cells with Leiden clustering
   to define subclusters that represent cell states. The Leiden resolution
   was determined by identifying unique subclusters in the data (DC 0h:
   1.2, DC 3h, 0.7, K562 cell cycle: 1.0, K562 TF: 1.0, Papalexi: 1.2). We
   then subset the data to highly variable genes (min_mean=0.0125,
   max_mean=3, min_disp=0.5), centered and scaled the highly variable
   genes data, and performed PCA to get the first 50 PCs. We then trained
   a linear support vector machine (SVM, C=1) on the top 50 PCs of the
   data to predict the cell subcluster (state) membership of the
   non-perturbed cells. We then applied the trained SVM to all cells that
   include both perturbed and non-perturbed cells to get a continuous
   probability of cell state for each cell. We used these continuous state
   probabilities in our regression equation to regress out state specific
   effects on gene expression. To identify the perturbation effect of a
   given guide RNA on genes other than the target gene, we utilized
   elastic net regression (l1_ratio=0.5, alpha=0.0005).[344]^32 We
   generated a binary matrix (cells x RNA guides) which depicts which
   guide RNAs are in each cell based on the perturb-seq sequencing data.
   We then fit an elastic net model to predict the gene expression of all
   genes from the binary matrix in each cell, combined with the continuous
   state values determined for each cell. To remove the effect of
   synergistic perturbations, we removed cells with multiple
   perturbations. We determined each guide’s perturbation effect on a
   given gene by the regression coefficient.
Determination of significant perturbation effects
   To determine the significance of a given perturbation coefficient, we
   employed a permutation test as in Dixit et al.[345]^32 For each guide,
   we permuted the vector of perturbations to randomize which cells
   received the given perturbation of interest. The elastic net regression
   model was trained with the same hyperparameters to determine the
   coefficients of perturbation. This approach was repeated 100 times to
   generate a null distribution of the perturbation effect of a given
   guide on each gene. The p-value was calculated as the fraction of null
   coefficients that were greater than or less than the true coefficient,
   determined by the sign of the coefficient. Significant perturbations
   were determined at a false discovery rate of 0.05 using the
   Benjamini-Hochberg procedure. This permutation approach was repeated
   for each guide RNA. A gene was determined as a downstream perturbation
   if at least one guide had a significant perturbation for the given
   gene.
Selection of genes and cells for perturb-seq networks
   To reduce computational cost and enable network building for all
   approaches, we first took the top 3,000 highly variable genes using the
   variance stabilizing transform method,[346]^83 including the
   differentially perturbed genes. We built two networks for each dataset,
   one using all cells and the other using only cells with non-zero
   expression of the gene of interest.
Evaluation of perturbation predictions
   We built networks for each perturbed gene separately using SCING,
   GRNBOOST2, ppcor, and PIDC. Starting from the perturbed gene of
   interest, at each depth in the network, we determined the TPR and FPR
   based on the perturbed genes computed above. This gives a TPR vs FPR
   graph, from which an AUROC was computed. For each perturbed gene, we
   calculated the AUROC and the TPR at a FPR of 0.05.
Network robustness evaluation based on training and held out testing data
   For each cell type in the Morabito et al dataset, we performed a train
   test split (50/50) of the cells and generated 10 sets of random
   subsamples of 3000 genes. We built a GRN on each gene subsample set in
   the training data and trained a gradient boosting regressor to predict
   the expression of each gene based on the gene expression of the
   predicted regulatory parents in the given GRN. We used the trained
   gradient boosting regressor to predict the expression of the 3,000
   genes in the test dataset and evaluated the performance based on the
   cosine similarity metric between the actual gene expression and
   predicted gene expression. We generated the cosine similarity values on
   the training and testing data separately and computed the test to train
   ratio of the cosine similarity, with a smaller test to train ratio
   indicating potential overfitting of the training data.
Computation of network characteristics
   We built GRNs on oligodendrocytes, astrocytes, and microglia from
   snRNAseq data from Morabito et al[347]^27 For each cell type, we
   randomly selected 3000 genes (reduce computational time of methods) for
   each sample and generated 10 GRNs, in which 3000 genes were randomly
   subsampled from the full transcriptome. To compute scale-free network
   characteristics for each network, we fit a linear regression model on
   the log of each node degree with the log of the proportion of nodes at
   each degree. We removed low degree data points that are an artifact of
   scRNAseq sparsity. We also characterized each network by the number of
   edges in the network, number of genes remaining in the resulting
   network, and the mean betweenness centrality of nodes across the
   network.
Computation of network overlap
   We used the Morabito et al. datasets described above and split each
   dataset in half and generated GRNs on each subset of cells. We checked
   the overlapping edges between the two networks and normalized for the
   expected number of overlapping edges based on the number of total edges
   in each network and the hypergeometric distribution. The overlap score
   measured the fraction of overlap between the two networks, divided by
   the expected number of overlapping edges.
Assessment of disease subnetwork retrieval of GRNs
   We utilized a random walk approach from Huang et al.[348]^35 to
   determine the ability for GRNs from different methods to accurately
   model disease gene subnetworks. This approach provides a biologically
   relevant benchmarking approach to determine a GRNs ability to model
   disease subnetworks. Briefly, the approach splits a known disease gene
   set into two groups, to attempt to reach the held out gene set starting
   from the selected disease genes through random walks. An improvement
   score is computed by calculating the z-score for a given network
   relative to 50 degree-preserved randomized networks.
   We built networks from the MCA on immune cells from bone marrow,
   neurons from the brain, and hepatocytes from the liver. To accommodate
   less efficient tools, we subsetted the transcriptome to genes that are
   expressed in more than 5% of the cells in the dataset. We used the
   method from Huang et al. using relevant immune, neuronal, and metabolic
   disease gene sets from DisGeNET. We kept these genes with >5% percent
   expression and included the genes from the disease gene sets. We
   determined performance of each subnetwork based on the improved
   performance compared to the random network distribution.
Application of SCING to construct GRNs for all MCA cell types and assessment
of network relevance to all DisGeNET disease gene sets
   We applied SCING to all cell types for all tissues in the MCA and
   utilized the approach from Huang et al. to determine the ability of
   each network to accurately model each disease gene set. We clustered
   the disease gene sets and cell types using hierarchical clustering with
   complete linkage. We determined the number of disease sets accurately
   modeled by each cell type based on a performance gain of at least 0.1.
   We subsequently computed the number of cell types that can accurately
   model each disease set. To compare the number of diseases modeled by
   cell types from the adaptive and innate immune system on tissue
   relevant subsets of the DisGeNET diseases, we performed a t-test
   between the distributions of the number of disease gene sets each cell
   type can accurately model.
Biological application to microglia in Alzheimer’s disease patients
   We built a SCING network for the microglia on the genes expressed in at
   least 2.5 percent of cells in the Morabito et al. dataset.[349]^27 For
   the SCING pipeline, we used 500 supercells, 70 percent of cells in each
   subsample, 100 neighbors, 10 PCs, and 100 subsamples. We utilized the
   Leiden graph partitioning algorithm to divide genes in the resulting
   GRNs into modules. We performed Leiden clustering at different
   resolutions and performed pathway enrichment analysis on the modules
   using the enrichr[350]^78 R package, using the GO biological process,
   DisGeNET, Reactome, BioCarta, and KEGG knowledge bases. We selected the
   resolution (0.0011) that had the highest fraction of modules annotated
   for between 20 and 50 modules per network. This avoids clustering too
   many modules with few genes while maintaining enough separate modules
   to have biological interpretation. We used the AUCell method from the
   SCENIC workflow,[351]^19 to retrieve module specific expressions
   (AUCell scores) for each cell. We found trait (diagnosis, plaque stage,
   tangle stage) associated modules by fitting a linear regression model
   to predict the trait based on the module score, while regressing out
   the effects of sex. For each trait, multiple testing was controlled at
   FDR <0.05 with the Benjamini-Hochberg procedure. The subnetwork for
   vesicle-mediated transport in module 2 was visualized using
   Cytoscape.[352]^84 We determined marker genes using the Allen Brain
   Atlas whole brain Smartseq2 data.[353]^85
Batch correction comparison
   We compared top batch correction methods from Seurat, Harmony, and
   fastMNN with SCING module embeddings. To evaluate each method, we
   determined the average proportion of cells with the same group
   assignment (sample, batch, diagnosis, tangle stage, plaque stage, and
   sex), using 20 PCs and a variable number of neighbors (0.25, 0.5, 1, 2,
   4, 8, and 16 percent of the dataset) ([354]Equation 1). We determined
   the ability of each approach to remove batch and sample specific
   differences while retaining biologically relevant differences
   (diagnosis, tangle stage, plaque stage, and sex) by removing the batch
   and sample differences with an F1-score[355]^44 ([356]Equation 2).
   [MATH: neigh_score=1nc
   ell
   ∑cellssimil
   mi>ar_neighborsn<
   /mi>_neighb
   mi>ors) :MATH]
   Equation 1
   neigh_score: neighborhood score used to find the average fraction of
   neighbors of the same type (i.e. batch).
   similar_neighbors: number of neighbors of a given cell that have the
   same identity (i.e. batch)
   n_neighbors: number of total neighbors checked
   n[cell]: total number of single cells
   [MATH:
   F1ph
   enotype<
   /mrow>=2∗(1−neigh_scoresam<
   mi>ple)∗(1−neigh_scorebat<
   mi>ch)∗(neigh_scorephe<
   mi>notype)(1−neigh_scoresam<
   mi>ple)+(1−neigh_scorebat<
   mi>ch)+(neigh_scorephe<
   mi>notype) :MATH]
   Equation 2
   neigh_score: neighborhood score computed in [357]Equation 1 for a given
   identity.
Application of SCING to visium spatial transcriptomics data for mouse AD and
WT brain
   To determine the applicability and interpretability of SCING to spatial
   transcriptomics data, we applied SCING to mouse whole brain AD and WT
   data.[358]^28 Since the network was built on the whole brain rather
   than a single cell type, we expect more variance amongst networks from
   subsamples, therefore we built 1,000 GRNs to be merged into the final
   network. We partitioned the genes with the Leiden graph
   partitioning algorithm into 33 modules. Using AUCell from
   SCENIC,[359]^19 we obtained module specific expression for each spot.
   We determined regional specificity between pairs of larger regions
   (cortex, hippocampus, brainstem) through t-tests and overall variance
   for the smaller subregions through ANOVA. We determine differential
   module expression between AD and WT through t-tests, and correlation
   with age or plaque with Pearson correlation. The null distribution of
   Pearson correlation coefficients was generated by randomly sampling
   genes with the same number of genes in the module, computing the AUCell
   scores for the random gene sets, and computing Pearson correlation
   between the AUCell scores and the plaque or age. The null distribution
   of correlation coefficients was used to determine the p value for each
   module’s correlation coefficient. Finally, the module 9, 10, 25, and 30
   subnetworks were visualized using Cytoscape[360]^84 and annotated cell
   type marker genes using the Allen Brain Atlas whole brain Smartseq2
   data.[361]^85
Acknowledgments