Abstract

Introduction

   Mathematical models are powerful tools that can be used to advance our
   understanding of complex diseases. Autoimmune disorders such as
   systemic lupus erythematosus (SLE) are highly heterogeneous and require
   high-resolution mechanistic approaches. In this work, we present
   ONIDsc, a single-cell regulatory network inference model designed to
   elucidate immune-related disease mechanisms in SLE.

Methods

   ONIDsc enhances SINGE’s Generalized Lasso Granger (GLG) causality model
   used in Single-cell Inference of Networks using Granger ensembles
   (SINGE) by finding the optimal lambda penalty with cyclical coordinate
   descent. We benchmarked ONIDsc against existing models and found it
   consistently outperforms SINGE and other methods when gold standards
   are generated from chromatin immunoprecipitation sequencing (ChIP-seq)
   and ChIP-chip experiments. We then applied ONIDsc to three large-scale
   datasets, one from control patients and the two from SLE patients, to
   reconstruct networks common to different immune cell types.

Results

   ONIDsc identified four gene transcripts: matrix remodelling-associated
   protein 8 (MXRA8), nicotinamide adenine dinucleotide kinase (NADK), RNA
   Polymerase III Subunit GL (POLR3GL) and Ultrabithorax Domain Protein 11
   (UBXN11) in CD4+ T-lymphocytes, CD8+ Regulatory T-Lymphocytes, CD8+
   T-lymphocytes 1 and Low Density Granulocytes that were present in SLE
   patients but absent in controls.

Discussion

   These genes were significantly related to nicotinate metabolism,
   ribonucleic acid (RNA) transcription, protein phosphorylation and the
   Rho family GTPase (RND) 1-3 signaling pathways, previously associated
   with immune regulation. Our results highlight ONIDsc’s potential as a
   powerful tool for dissecting physiological and pathological processes
   in immune cells using high-dimensional single-cell data.

   Keywords: network inference, Systemic lupus erythematosus (SLE),
   single-cell, mathematical modeling, gene marker

Introduction

   Autoimmune diseases are highly heterogeneous conditions, characterized
   by immune cell dysregulation that plays a central role in pathogenesis
   ([35]1). Despite their heterogeneity, they share genetic risks,
   molecular pathway dysregulations, and clinical manifestations ([36]2).
   Systemic lupus erythematosus (SLE) is one such disease with a
   prevalence of 43.7 per 100,000 people ([37]3). Longitudinal studies
   have shown that the interferon (IFN) signature, a set of genes
   overexpressed upon stimulation of IFN receptors, is associated with
   clinical phenotypes, disease activity, and accrual damage. These
   studies often use the SLE Disease Activity Index (SLEDAI) to stratify
   patient severity ([38]4–[39]6), and have also linked the IFN signature
   to the activation states of various immune cell types ([40]7–[41]10).
   Multiple immune cell types in SLE patients exhibit elevated IFN
   signatures, contributing to tissue inflammation and autoantibody
   production ([42]11, [43]12). The role of T lymphocytes is more
   controversial, likely due to different roles associated with different
   T lymphocytes subtypes. CD8+ T cells (CD8TCs) and how IFN signatures
   contribute to autoantibody production and inflammation, but they also
   exhibit features of exhaustion, which increases infection risk and
   promotes autoimmunity ([44]13, [45]14). Furthermore, a reduction in
   CD8+ regulatory T cell (CD8TREG) subpopulation can result in increased
   CD4+ T cell (CD4TCs) driving SLE pathogenesis ([46]14). Type I IFN
   genes activate monocytes (MONOs), B and T lymphocytes ([47]15, [48]16),
   while Type II IFNs have paradoxical effects on macrophages and
   dendritic cells, activating some phenotypes while suppressing others
   ([49]15). In addition, Type II IFNs promote B cell activation and
   plasma cell (PC) differentiation in SLE ([50]17). To unravel the
   complexity of immune-related disease mechanisms, high-resolution
   mechanistic approaches are required ([51]18). With this perspective,
   projects like 3TR (taxonomy, treatment, targets and remission) have
   been designed to better understand the mechanisms of response and
   non-response to therapy of autoimmune, inflammatory and allergic
   diseases ([52]19). By investigating shared features across patients and
   diseases, these efforts may uncover common pathways involved in
   treatment response and disease progression ([53]20). Typically, bulk
   ribonucleic acid (RNA) sequencing has been used to measure average gene
   expression across samples. However, averaging transcriptional
   information from heterogeneous cell populations can obscure important
   biological variation. Recent advances in single-cell transcriptomics
   now allow the study of gene regulatory networks at cellular resolution,
   offering a more detailed view of immune cell behavior ([54]21, [55]22).

   Mathematical models are powerful tools for generating and testing
   hypotheses about network structure and function helping to advance our
   understanding of complex diseases. Machine learning (ML) is making
   significant strides in SLE research and diagnosis, offering potential
   advancements in disease understanding, prediction, and treatment. ML
   algorithms can analyze complex datasets to identify patterns and
   predict disease activity, flare risks and treatment responses,
   ultimately aiding in early diagnosis and personalized patient care.
   While some ML approaches are based on radiologic imaging, others
   leverage multiomic data sources ([56]23–[57]25). Several network
   inference algorithms have been developed to reconstruct regulatory
   networks from bulk, such as Gene Network Inference with Ensemble of
   trees (GENIE3) ([58]26), and single-cell RNA sequencing data
   ([59]27–[60]30). Some of these methods, including Single-cell Inference
   of Networks using Granger ensembles (SINGE) ([61]30), Singular
   Component Detection and Optimization (SCODE) ([62]31) and Single-Cell
   Regularized Inference Using Time-Stamped Expression Profiles
   (SINCERITIES) ([63]32) rely in pseudotime, while others, like Jump Tree
   (JUMP3) use real-time points ([64]33). The performance of these network
   inference algorithms has been benchmarked in different publications
   ([65]30, [66]34, [67]35), with GENIE3 consistently ranking among the
   top-performing methods, as shown by McCalla and colleagues. In these
   comparisons, SINGE was not evaluated ([68]34). Deshpande and colleagues
   later showed that SINGE performed favorably in terms of precision and
   recall to the other four existing methods, including GENIE3, when
   evaluated using chromatin immunoprecipitation sequencing (ChIP-seq),
   ChIP-chip, loss of function (LoF), and gain of function (GoF) data
   ([69]30). SINGE is a kernel-based generalization of the Generalized
   Lasso Granger (GLG) causality model that infers regulatory networks
   from pseudotime-ordered single cells. It depends on a large set of
   hyperparameters. Rather than optimizing them, SINGE aggregates results
   from a reduced set of hyperparameter combinations to improve
   robustness. Although this ensemble approach performs better than most
   individual configurations, it does not achieve the best average
   precision. The lambda penalty hyperparameter has the greatest impact on
   performance. Moreover, while ensemble aggregation is suitable for small
   datasets (<3100 genes), it increases runtime, requires substantial
   computational resources, and is not scalable to larger datasets. In
   fact, most benchmarking studies have used relatively small datasets
   (<2000 and <8000 genes, respectively) ([70]34, [71]35); whereas
   real-world patient datasets often contain >20,000 genes.

   Finding the optimal set of hyperparameters is essential for generating
   meaningful networks. Hyperparameters directly influence model
   performance, affecting metrics such as accuracy, precision, recall,
   stability, scalability, and runtime ([72]30, [73]34, [74]35). Proper
   hyperparameter tuning also helps balance model complexity, avoid over-
   or underfitting and allow adapt the model to different datasets. The
   choice of optimizer depends on the nature of the problem. While simple,
   differentiable regression problems may be solved using gradient descent
   ([75]36), more complex, non-differentiable Lasso regression problems
   require advanced methods such as cyclical coordinate descent. This
   technique iteratively adjusts one hyperparameter at a time, optimizing
   the objective function while fixing the rest of the hyperparameters.
   This approach is computationally efficient and particularly effective
   for large sparse single-cell datasets, as it breaks down the complex,
   multivariable optimization problem into a simpler, one-dimensional
   step-by-step process ([76]37, [77]38).

   In this study, we introduce ONIDsc, a pseudotemporal-based single-cell
   network inference algorithm that enhances SINGE’s GLG causality model
   by finding the optimal lambda penalty with cyclical coordinate descent.
   We compare ONIDsc network inference to other existing methods and find
   it consistently outperforms SINGE ensemble. Furthermore, ONIDsc
   outperformed all methods when gold standards were generated from
   ChIP-chip and ChIP-seq experiments. We then apply ONIDsc to one control
   and two SLE patient datasets to reconstruct networks common to
   different immune cell types. We proposed four genes likely involved in
   SLE pathogenesis. In summary, we demonstrate that combining advanced
   optimization with state-of-the-art single-cell network inference
   enables scalable analysis of large patient datasets and provides new
   insights into immune-related disease mechanisms ([78] Figure 1 ).

Figure 1.

   [79]Diagram explaining the ONIDsc model using cyclical coordinate
   descent. Top right shows a spiral convergence pattern, leading to
   “Optimal Lambda” and the GLG Causality model. Various colored genes
   (HLA-DRA, TMSBAX, ISG15, S100A8) are indicated. Graphs show gene
   expression versus pseudotime, and a scatter plot illustrates model
   boundaries with positive and negative outcomes. Listed benefits include
   reduced mean squared error, improved accuracy, precision, recall,
   reduced runtime, and increased scalability.
   [80]Open in a new tab

   Overview of ONIDsc: a scalable regulatory network inference framework
   using optimized GLG causality. ONIDsc uses pseudotemporally ordered
   single-cell gene expression data to infer regulator-target gene
   networks based on SINGE’s GLG causality test. Unlike SINGE, which uses
   five fixed lambda hyperparameter values, ONIDsc identifies a single
   optimal lambda using cyclical coordinate descent. This optimization
   improves performance by reducing MSE, increasing (early) accuracy,
   precision, and recall, reducing runtime, and enhancing scalability,
   making ONIDsc suitable for large patient datasets.

Materials and methods

Pre-processing

   Three scRNA-seq different datasets were analyzed including one with
   five idiopathic intracranial hypertension (IIH) patients as controls
   ([81]39), one with three SLE patients ([82]7) and one with fifty-six
   SLE patients ([83]11) ([84] Figure 2A ). Pathological characteristics
   were examined in IIH patients showing oligoclonal bands (proteins
   indicating inflammation of the central nervous system) were not
   detected, CSF and blood were unaffected, and there was no other
   diagnosed disease in the controls ([85] Supplementary Figure 1 of the
   referred paper ([86]39)). Thus, they were considered appropriate
   controls.

Figure 2.

   [87]Flowchart depicting a study on gene network inference using
   single-cell RNA sequencing. Section A shows a dataset of patients.
   Section B describes pre-processing, including alignment and filtering
   of FASTQ files. Section C outlines downstream analysis: 1. Cell type
   selection via t-SNE plot; 2. Pseudotime inference with scatter plot; 3.
   Lasso hyperparameter optimization with coefficient and MSE graphs; 4.
   Gene network inference using a GLG causality model and adjacency
   matrix. Histograms show frequency distributions for lambda values.
   [88]Open in a new tab

   ONIDsc pipeline: from subset selection and preprocessing to network
   inference. Diagram of the ONIDsc pipeline summarizing (A) the subsets
   analyzed, (B) preprocessing steps including normalization and
   pseudotime ordering, and (C) the downstream analysis stages leading to
   GLG causality-based network inference. The downstream analysis is
   divided into four main components.

   All datasets came from peripheral blood mononuclear cells (PBMCs).
   First, fastq files from sequenced single cells were downloaded from
   public repertoires ([89] Figure 2B ). The Schafflick and colleagues
   ([90]39) dataset was downloaded from the Gene Expression Omnibus (GEO)
   repository using accession number: [91]GSE141797. The Mistry and
   colleagues ([92]7) dataset was downloaded from the GEO repository using
   accession number: [93]GSE142016. The Nehar-Belaid and colleagues
   ([94]11) dataset was downloaded from the Genotypes and Phenotypes
   (dbGAP) database using accession number: phs002048.v1.p1. All datasets
   were pre-processed using the same protocol (For a detailed description,
   see [95]Supplementary Material 1 ).

Cell type selection and pseudotime

   Single cells from patient datasets were categorized into different
   immune-related cell types based on prior cell marker analysis ([96]7,
   [97]40, [98]41) ([99] Supplementary Table 1 ; [100]Figure 2C ). First
   the different CD8TC subtypes (CD8TC1s, CD8TC2s, CD8TC17s and CD8TREGs)
   were selected. Intersecting cells between each CD8TC subtype with the
   other three were removed to avoid redundancy in our analysis. Then
   CD4TCs, Low Density Granulocytes (LDGs) and MONOs were selected.
   Intersecting cells between MONOs and CD8TCs, CD4TCs and LDGs were
   removed from MONO cell types. Finally, B cell subtypes, PCs and memory
   B cells (MBCs), were selected. Intersecting cells between each of the B
   cell subtypes and CD8TCs, CD4TCs and LDGs were removed from both B cell
   subtypes. The level of subtyping was based on the relevance of the
   given cell population for the disease of interest. After cell typing,
   single cells within each of the subtypes were ordered along an immune
   cell activation pseudotemporal process represented by all type I and II
   IFN-related genes, including IFN alpha, IFN beta, IFN gamma, IFN
   lambda, IFN epsilon, IFN omega, IFN kappa, Interferon-stimulated Gene
   15 (ISG15) and Interferon-stimulated Gene 20 (ISG20) ([101] Figure 2C
   ). These genes were used as indicators of the cell activation state due
   to their established role in activating the studied cell types. The sum
   of the normalized gene expression of all the IFN-related genes was used
   as the cell order.

Lasso optimization

   The Lambda hyperparameter was optimized using cyclical coordinate
   descent. This is an optimization algorithm that operates by
   sequentially minimizing a multivariable function along one coordinate
   direction at a time. It begins with an initial guess for the variables
   and iteratively updates each variable sequentially. In each iteration,
   all variables except one are fixed, and the function is minimized with
   respect to the unfixed variable. This process is repeated, typically
   cycling through the coordinates in a systematic fashion, until the
   changes in the function value or the variables fall below a predefined
   threshold, indicating convergence.

   We used glmnet package ([102]42) to compute the regularization path for
   lambda. First, genes with less than four non-zero gene expression
   values were filtered out. During ten-fold cross-validation, glmnet
   requires the response variables to have at least five unique values.
   Genes with constant expression were also filtered out to avoid glmnet
   failure during the standardization procedure due to their standard
   deviation being zero. The expression of each gene (response variable)
   and that of the rest of the genes (predictors) was modelled in the
   expression matrix. Glmnet function was used to fit a generalized linear
   model via penalized maximum likelihood ([103]Equation 1) solving the
   following minimization problem over the entire range of lambda values:
   [MATH:
   <mi>m</mi><mi>i</mi><mi>n</mi><mrow><msub><mi>β</mi><mn>0</mn></msub><m
   o>,</mo><mi>β</mi><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mo>∑<
   /mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><
   mo> </mo><msub><mi>w</mi><mi>i</mi></msub><msub><mi>l</mi><mi>i</mi></m
   sub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>,</mo
   ><msub><mi>β</mi><mn>0</mn></msub><mo>+</mo><msup><mi>β</mi><mi>T</mi><
   /msup><msub><mi>x</mi><mi>i</mi></msub></mrow><mo
   stretchy="false">)</mo></mrow><mo>+</mo><mi>λ</mi><mrow><mo>[</mo><mrow
   ><mfrac><mrow><mn>1</mn><mo>−</mo><mi>α</mi></mrow><mn>2</mn></mfrac><m
   o>‖</mo><mi>β</mi><msubsup><mo>‖</mo><mrow><mn>2</mn></mrow><mn>2</mn><
   /msubsup><mo>+</mo><mi>α</mi><mo>‖</mo><mi>β</mi><msub><mo>‖</mo><mn>1<
   /mn></msub></mrow><mo>]</mo></mrow></mrow> :MATH]
   (1)

   Where l[i] (y[i],n[i] ) was the negative log-likelihood contribution
   for observation i. Alpha was a tuning parameter used to define the
   lasso problem by setting it to one. Lambda controlled the severity of
   the penalty.

   The above-mentioned minimization problem was solved using a Gaussian
   family ([104]Equation 2). Supposing there were observations χ [i] ∈ℝ ^p
   and responses y[i] ∈ℝ,i=l, ….N. The objective function for the Gaussian
   family was:
   [MATH: <mrow><mi>m</mi><mi>i</mi><mi>n</mi><mo
   stretchy="false">(</mo><msub><mi>β</mi><mn>0</mn></msub><mo>,</mo><mi>β
   </mi><mo
   stretchy="false">)</mo><mo>∈</mo><msup><mtext>ℝ</mtext><mrow><mi>p</mi>
   <mo>+</mo><mn>1</mn></mrow></msup><mfrac><mn>1</mn><mrow><mn>2</mn><mi>
   N</mi></mrow></mfrac><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mi>N</mi></munderover><msup><mrow><mrow><mo> </mo><mo
   stretchy="false">(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo
   ><msub><mi>β</mi><mn>0</mn></msub><mo>−</mo><msubsup><mi>x</mi><mi>i</m
   i><mi>T</mi></msubsup><mi>β</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mn>2</mn></msup><mo>+</mo><mi>λ</
   mi><mrow><mo>[</mo><mrow><mfrac><mrow><mn>1</mn><mo>−</mo><mi>α</mi></m
   row><mn>2</mn></mfrac><mo>‖</mo><msubsup><mo>‖</mo><mrow><mn>2</mn></mr
   ow><mn>2</mn></msubsup><mo>+</mo><mi>α</mi><mo>‖</mo><mi>β</mi><msub><m
   o>‖</mo><mn>1</mn></msub></mrow><mo>]</mo></mrow></mrow> :MATH]
   (2)

   Cyclical coordinate descent was then applied to solve the problem
   ([105]Equation 3). Supposing current estimates
   [MATH: <mrow><msub><mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mn>0</mn></msub></mrow>
   :MATH]
   and
   [MATH: <mrow><msub><mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mi>l</mi></msub></mrow>
   :MATH]
   ∀ℓ∈1, …,p. By computing the gradient at βj=
   [MATH: <mrow><msub><mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mi>j</mi></msub></mrow>
   :MATH]
   and using simple calculus, the problem was updated to:
   [MATH: <mrow><msub><mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mi>j</mi></msub><mi>←</mi><m
   frac><mrow><mi>S</mi><mrow><mo
   stretchy="false">(</mo><mrow><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsu
   p><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msub
   sup><mtext> </mtext><msub><mi>x</mi><mrow><mi>i</mi><mi>j</mi></mrow></
   msub><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><
   mrow><mo stretchy="false">(</mo><mrow><msub><mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mn>0</mn></msub><mo>+</mo><m
   subsup><mo>∑</mo><mrow><mi>l</mi><mo>≠</mo></mrow><mi>j</mi></msubsup><
   mo> </mo><msub><mi>x</mi><mrow><mi>i</mi><mi>l</mi></mrow></msub><msub>
   <mover
   accent="true"><mi>β</mi><mo>∼</mo></mover><mi>l</mi></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo>)</mo></mrow><mo>,</mo><mi>λ</
   mi><mi>α</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><mn>1</mn><mo>+</mo><mi>λ</m
   i><mrow><mo
   stretchy="false">(</mo><mrow><mn>1</mn><mo>−</mo><mi>α</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow></mfrac></mrow> :MATH]
   (3)

   where S(z, y) was the soft-thresholding operator with value sign
   (z)(|z|−y)[+].

   Cross-validation is a crucial component of the Lasso regression fitting
   process. We used cv.glmnet function to apply k-fold cross-validation in
   order to find an optimal minimum and maximum lambda for each gene
   ([106] Figure 2C ). We used the default number of folds (nfolds = 10),
   adequate for large datasets. This step involved dividing the data into
   training and validation sets multiple times to ensure that the chosen
   lambda minimized the mean squared error (MSE) in a robust and
   generalizable manner. The distribution of minimum and maximum lambdas
   showed a severe left-skewed distribution. Thus, the most frequent
   minimum and maximum for all genes were computed. We analyzed the
   distribution of minimum and maximum lambdas and the MSE per gene of
   different lambda values within the optimal range for several datasets.

GLG network inference

   Ordered single-cell RNA-seq data were taken as input and analyzed using
   multiple GLG instances with different hyperparameters. Each
   hyperparameter set resulted in an adjacency matrix (gene by gene
   matrix) and each element in the adjacency matrix represented the weight
   of the relation between a gene pair ([107] Figure 2C ). Genes could be
   regulators and/or targets. Non-zero elements were then ranked and
   aggregated using a modified borda count, with an optional subsampling
   stage increasing the effective ensemble size. The main hyperparameters
   included the lambda, the number of replicas, the time resolution, the
   extent of the lagged time series, the kernel width, the zero handling
   and the probability of zero removal. Default values specified in SINGE
   were used for all hyperparameters except for the probability of zero
   removal, lambda and the number of replicas. We have used a probability
   of zero removal of 0.2 in all our benchmarking and patient dataset
   analysis. Thus, for each gene, each zero-valued sample and its
   corresponding pseudotime were removed with a probability of 0.2. This
   was done to mitigate the dropout severity of the patient samples.
   Interestingly, the effect of changing the zero-removal probability was
   previously studied, showing that the average early precision and
   average precision were marginally affected. Cyclical coordinate descent
   was used to find a single lambda hyperparameter value within the
   optimal range. Eight technical replicas per hyperparameter combination
   were analyzed to reduce resources consumed, since it was found that
   from five replicas on, the average early precision plateaued ([108]30).

Benchmarking

   Two small-sized datasets and their corresponding gold standards were
   used for benchmarking purposes. The first dataset was the embryonic
   stem cell (ESC) to endoderm differentiation dataset from Hayashi and
   colleagues ([109]43) with 356 cells and 100 genes. As a gold standard,
   652 known interactions were taken from the embryonic stem cell atlas
   from the pluripotency evidence ESCAPE database ([110]44). To generate
   the gold standard, 95% of ChIP-chip/ChIP-seq and 5% LoF and GoF
   interactions were combined. The second dataset was the retinoic acid
   dataset from Semrau and colleagues ([111]45) with 1886 cells and 626
   genes. As a gold standard, 1862 known interactions were taken from the
   embryonic stem cell atlas from the pluripotency evidence ESCAPE
   database ([112]44). To generate the gold standard, 52% of
   ChIP-chip/ChIP-seq and 48% LoF and GoF interactions were combined. To
   ensure the reliability of the gold standard used for benchmarking, we
   employed the ESCAPE database, one of the most comprehensive
   repositories of experimentally validated interactions in mouse
   embryonic stem cells (ESCs). ESCAPE integrates data from multiple
   experimental sources, including ChIP-seq, RNAi, and protein-protein
   interaction studies. Interactions not documented in ESCAPE were assumed
   to be absent.

   We benchmarked ONIDsc against five other existing regulatory network
   inference methods: SINGE ([113]30), SINCERITIES ([114]32), SCODE
   ([115]31), JUMP3 ([116]33), and GENIE3 ([117]26). Default settings were
   used to run all methods. SINGE’s ensemble outputs was run four times
   using the same default hyperparameters for both benchmarking datasets:
   10 replicas, [0,0.01,0.02,0.05,0.1] lambdas, [ (3,5) (5,9) (9,5)
   (5,15); (15,5)] time resolution and number of lag combinations, [0.5;
   1; 2; 4] kernel widths, 0 zero handling and the 0.2 probability of zero
   removal. Additionally, we run SINGE with one of the benchmarking
   datasets using default values except for the lambda parameter, which
   was set to [0], [0.02], [0.05], [0.1] individually to assess the
   performance of each single lambda value ([118] Figure 3 ). SINGE was
   always used in combination with Monocle for pseudotime inference
   ([119]28). ONIDsc was also run four times on both benchmarking datasets
   using the following hyperparameters: 10 replicas, [ (3,5) (5,9) (9,5)
   (5,15); (15,5)] time resolution and number of lag combinations, [0.5;
   1; 2; 4] kernel widths, 0 zero handling and the 0.2 probability of zero
   removal. The embryonic stem cell (ESC) to endoderm differentiation
   dataset was run using [0.01] optimal lambda ([120] Figure 3 ) while the
   retinoic acid dataset was run using [0.1] optimal lambda ([121]
   Supplementary Figure 1 ). The Lambda parameter value used to run ONIDsc
   was within the optimal range. GENIE3, originally developed for bulk
   transcriptomics, was applied without cell ordering or pseudotime. JUMP3
   used cell ordering information, while the remaining methods relied on
   pseudotime. SCODE was run with four degrees of freedom for the ESC to
   endoderm dataset and twenty degrees of freedom for the retinoic acid
   dataset to account for the larger gene network size.

Figure 3.

   [122]A series of graphs from a scientific analysis: A. Two histograms
   showing the frequency of Lambda values, highlighting maximum and
   minimum values. B. Error bar plot of Mean Squared Error (MSE) against
   Lambda. C. Line plots of pseudotime with Monocle and ONIDsc, showing a
   cell order match of 0.4%. D. Scatter plot comparing accuracy and recall
   for different methods, with a legend indicating each method. E. Scatter
   plot of accuracy versus runtime for various methods. F. Scatter plot of
   precision versus recall, highlighting specific data points. G. Scatter
   plot of early precision versus early recall. H. Scatter plot of early
   accuracy versus early recall. Each data point is color-coded according
   to the method used, as shown in the legend.
   [123]Open in a new tab

   Benchmarking ONIDsc against regulatory network and pseudotime methods
   on ESC-to-endoderm differentiation data. (A) Minimum (red) and maximum
   (green) optimal lambdas. (B) MSE distribution over lambdas proposed by
   SINGE and ONIDsc. (C) Comparison of pseudotime inference between ONIDsc
   and Monocle. (D) Accuracy over recall, (E) accuracy over run time, (F)
   precision over recall, (G) early precision over early recall and (H)
   early accuracy over early recall. Six regulatory network inference
   algorithms were tested: GENIE3 (light blue), JUMP3 (dark blue), ONIDsc
   (purple), ONIDsc’s network inference combined with Monocle’s pseudotime
   (blue), SCODE (fuchsia), SINCERITIES (dark green) and SINGE combined
   with Monocle’s pseudotime (brown). SINGE was also run with each of the
   lambda hyperparameter values individually: L0 (orange), L0p02 (light
   green), L0p05 (red), and L0p1 (black). Four replicas were run for
   GENIE3, ONIDsc and SINGE. The standard deviation is represented with
   bars. T-tests comparing the average precision of ONIDsc, with GENIE3
   and SINGE were performed. Statistically significant differences (P
   value< 0.05) were represented with an asterisk and the p value is
   indicated above. Both asterisks and p-value are colored by the method
   ONIDsc was compared with. Methods are also represented with different
   shapes as indicated in the figure legend.

   We also benchmarked ONIDsc against another pseudotime inference method,
   Monocle ([124]28), which was combined with SINGE’s and ONIDsc’s network
   inference to order single cells along a pseudotime ([125]46). ONIDsc
   pseudotemporal concept was applied to both benchmarking datasets by
   ordering the cells based on the sum of the log expression of the genes
   known to drive the differentiation process in each of them. For the
   embryonic stem cell (ESC) to endoderm differentiation dataset, the
   following gene markers were used: Undifferentiated Embryonic Cell
   Transcription Factor 1 (UTF1), Myelocytomatosis Oncogene (MYC),
   Krüppel-like Factor 4 (KLF4), Sex Determining Region Y-Box 2 (SOX2),
   and Octamer-Binding Transcription Factor 4 (OCT4), based on previous
   studies demonstrating their involvement in ESC differentiation
   ([126]47). For the retinoic acid differentiation dataset, the following
   pluripotency-associated markers were used: Zinc Finger Protein 42
   (ZFP42), POU Class 5 Homeobox 1 (POU5F1), KLF4, Developmental
   Pluripotency Associated 5A (DPPA5A), Estrogen-Related Receptor Beta
   (ESRRB), SOX2, DPPA3, Fibroblast Growth Factor 4 (FGF4), Krüppel-like
   Factor 2 (KLF2), and Transcription Factor CP2-Like 1 (TFCP2L1), based
   on prior findings linking them to pluripotency ([127]45).

   To evaluate the quality of regulatory network predictions, we used
   standard performance metrics including accuracy, precision, and recall.
   Accuracy measures the overall correctness of the model’s
   classifications. Precision assesses the proportion of true positives
   among predicted positives, penalizing false positives, while recall
   evaluates the model’s ability to identify all true positives,
   penalizing false negatives. To assess the top-ranked regulator–gene
   interactions—often the most biologically relevant in experimental
   settings—we also computed early accuracy, precision, and recall. These
   early metrics were calculated using partial recall thresholds of 0.1
   for the ESC to endoderm differentiation dataset and 0.3 for the
   retinoic acid dataset. The method runtime was evaluated for GENIE3,
   SINGE, and ONIDsc. To determine whether the evaluated methods performed
   better than random, we generated control networks by assigning random
   interactions between genes from each dataset. Method variability was
   assessed using four independent runs for GENIE3, SINGE, ONIDsc, and the
   random control.

Software and resources

   ONIDsc was implemented using a variety of softwares. Cell type
   selection and pseudotime inference algorithms were built in Matlab
   ([128]48) (version R2022a). The Lasso optimization algorithm was built
   in R ([129]49) (version 4.3.2) using various libraries: stringr
   (version 1.5.1), BiocManager (version 1.30.22), e1071 (version 1.7-14),
   ggplot2 (version 3.5.0), dplyr (version 1.1.4), glmnet (version 4.1-8),
   Matrix (version 1.6-5), ggtext (version 0.1.2), readr (version 2.1.5),
   pheatmap (version 1.0.12). Gene regulatory network inference algorithm
   SINGE (version 0.5.1) was provided by on Matlab-based Docker
   containers. Docker images were converted to singularity ([130]50)
   (version 2.4) to be run on two high-performance computing (HPC)
   clusters. GENIE3 was run in parallel using R packages: GENIE3 (version
   1.26.0), doParallel (version 1.0.17) and doRNG (version 1.8.6). Network
   visualization was done using Cytoscape ([131]51) (version 3.10.2) and
   pathway enrichment analysis (PEA) was performed using Reactome
   ([132]52) (version 88), gene ontology (GO; version 3.18.0) ([133]53)
   and R ([134]49) (version 4.3.2). Simulations were done on Sonic HPC
   cluster, for use by the UCD research community, and the Kay cluster of
   Ireland’s national HPC cluster. Each dataset was run using LongQ and
   large partitions on Kay and Sonic clusters. Serial batches were
   resourced with one node, eight central processing units (CPUs) and 150
   gigabites (GBs) of random-access memory (RAM). Runs were split over
   different serial batches. Jobs were parallelized between two and
   thirty-six times, depending on the size of the dataset which determined
   the run duration. ONIDsc source code is publicly available on GitHub
   ([135]https://github.com/ElenaMerinoTejero/ONIDsc-master.git). For a
   more detailed analysis of the resources used to analyze the Mistry and
   colleagues ([136]7) dataset with SINGE and ONIDsc algorithms, see
   [137]Supplementary Files 1 and [138]2 . For a detailed core and runtime
   analysis of the Nehar-Belaid and colleagues dataset ([139]11) run with
   ONIDsc see [140]Supplementary File 3 .

Results

ONIDsc overview and benchmarking

   We developed a network inference algorithm, ONIDsc, in order to enhance
   SINGE’s GLG causality model which was previously described by Deshpande
   and colleagues ([141]30). ONIDsc is based on SINGE and uses
   pseudotemporally ordered single-cell gene expression data to predict a
   regulator-target network that underlies a biological process. The GLG
   causality model uses a statistical hypothesis, Granger Causality, which
   tests the predictive causality between a regulator and its target
   within a time series. Instead of real-time series, the relative
   position of the cells within a trajectory or pseudotime representing a
   given biological process was inferred. We avoided using common
   pseudotime methods that use all genes from the dataset, in an
   unsupervised manner, to derive the cell order. Instead, we first
   performed a Log X+1 normalization, to reduce the data skewness and then
   ordered single cells within each cell type category based on the sum of
   the normalized expression of IFN-related genes ([142] Figure 2C ).
   IFN-related genes represent the cell activation process. These genes
   are expressed during immune cell activation ([143]54) and have also
   been found to correlate with disease severity ([144]4). This approach
   ensured that the cell order specifically reflected the biological
   process under investigation. Granger Causality combined with lasso
   regression allows for adjusting model complexity by performing variable
   selection and regularization to avoid under- or overfitting. The degree
   to which variables are selected depends on the lambda hyperparameter.
   ONIDsc introduces the use of cyclical coordinate descent to identify
   the optimal lambda, which allows the model to adjust the degree of
   variable selection to each dataset, thereby enhancing performance.

   To assess the ability of ONIDsc to improve other existing
   state-of-the-art methods, we performed benchmarking using two different
   datasets. We benchmarked the full ONIDsc method, based on a prior
   marker-based pseudotemporal ordering concept combined with optimal GLG
   causality network inference against five regulatory network inference
   methods and one pseudotime ordering method. The five regulatory network
   inference methods included SINGE ([145]30), based on GLG causality
   using five different lambdas, SINCERITIES ([146]32), based on ridge
   regression Granger causality, SCODE ([147]31), based on ordinary
   differential equations, JUMP3 ([148]33), based on decision trees and
   GENIE3 ([149]26), a tree-based method. The pseudotime ordering method
   was Monocle ([150]28), an established method based on an unsupervised
   reversed graph embedding approach paired with clustering techniques.

   First, we applied ONIDsc to the ESC to endoderm differentiation dataset
   with 95% of the interactions coming from ChIP-chip and ChIP-seq, and 5%
   from LoF and GoF data. We found the optimal minimum and maximum lambda
   per gene using a cyclical coordinate descent algorithm. The most
   frequent minimum and maximum lambdas were 0.004 and 0.01, respectively
   ([151] Figure 3A ). Lambdas outside the optimal range resulted in
   increased MSE ([152] Figure 3B ). ONIDsc and Monocle pseudotime methods
   were compared ([153] Figure 3C ) and combined with ONIDsc network
   inference method to assess the contribution of each part of ONIDsc to
   the overall method performance. We assessed performance in terms of
   accuracy, precision, recall, and runtime, focusing on early or
   top-ranked regulator-gene interactions ([154] Figures 3D–H ). The full
   ONIDsc method, combining its own pseudotime inference and network
   inference, was the best-performing method in five out of seven
   performance metrics, namely precision, accuracy, early precision, early
   recall, and early accuracy. For the remaining metrics, in terms of
   recall, the ONIDsc full method was outperformed by GENIE3, JUMP3, SINGE
   ensemble, and ONIDsc network inference when combined with Monocle’s
   pseudotemporal order. SINGE’s ensemble outputs and ONIDsc combined with
   Monocle were tied in recall. In terms of runtime, the ONIDsc full
   method was the second fastest after GENIE3, while significantly
   improving upon SINGE’s ensemble outputs. Overall, ONIDsc full method
   was the best-performing method for this dataset. SCODE performed worse
   than random across all metrics. When examining the performance of SINGE
   run with suboptimal lambdas individually, we found ONIDsc full method
   to outperform all individual lambdas across all metrics. Lambda 0.1
   performed similarly to or worse than SINGE’s ensemble and random across
   all metrics. Lambda 0 also performed worse than SINGE’s ensemble and
   random in the three early metrics and had a runtime nearly as long as
   SINGE’s ensemble. The fact that certain suboptimal lambdas showed
   increased MSE along with drastically reduced performance may explain
   why ONIDsc outperformed both SINGE’s ensemble and SINGE with suboptimal
   lambdas in six out of seven performance metrics: precision, accuracy,
   runtime, early precision, early recall, and early accuracy.

   We then applied ONIDsc to the retinoic acid dataset, with 52% of the
   interactions coming from ChIP-chip and ChIP-seq and 48% from LoF and
   GoF data. Most frequent minimum and maximum lambdas were 0.03 and 0.2
   ([155] Supplementary Figure 1A ). Lambdas outside the optimal range
   resulted in increased MSE ([156] Supplementary Figure 1B ). Performance
   assessment revealed JUMP3 as the best method in five out of seven
   performance metrics: precision, accuracy, early precision, early
   recall, and early accuracy. In terms of recall, JUMP3 and SINCERITIES
   were tied as the best methods. GENIE3 was the fastest method ([157]
   Supplementary Figures 1C–G ). The full ONIDsc method ranked fifth in
   six out of seven performance metrics, namely precision, recall,
   accuracy, early precision, early recall, and early accuracy.
   Interestingly, we observed a slight performance improvement when
   combining ONIDsc network inference with Monocle’s pseudotemporal
   ordering. This combination ranked fourth in six out of seven
   performance metrics. ONIDsc outperformed SINGE’s ensemble outputs
   across all metrics. SCODE again performed worse than random.

   Finally, we explored the cellular topology of both benchmarking
   datasets using t-distributed stochastic neighbor embedding (t-SNE)
   dimensionality reduction ([158] Supplementary Figures 2A, B ). This
   allowed us to better understand the underlying patterns and
   relationships in each dataset. The ESC to endoderm differentiation
   dataset displayed a more linear or single-path topology ([159]
   Supplementary Figure 2A ), while the retinoic acid dataset exhibited a
   tree-like, bifurcated, or multi-path topology ([160] Supplementary
   Figure 2B ). This difference in topology may explain why ONIDsc’s
   pseudotime inference outperformed Monocle in the ESC dataset but
   underperformed in the retinoic acid dataset.

   Altogether, these results show that cyclical coordinate descent can
   enhance the GLG causality test by identifying the optimal degree of
   variable selection, resulting in reduced MSE and improved (early)
   precision, accuracy, and recall. They also suggest that ONIDsc may be
   better suited for identifying interactions from ChIP-chip and ChIP-seq
   experiments, which are primarily direct (causal), than from gene
   expression data (LoF and GoF), which may include indirect
   (correlational) interactions. Finally, ONIDsc’s pseudotime inference
   method appears more effective for single-cell datasets with linear or
   single-path topologies.

Case studies

Five IIH patient dataset: network inference of controls using ONIDsc

   As the next step, we applied ONIDsc to the dataset from Schafflick and
   colleagues ([161]39). This analysis aimed to identify baseline networks
   inferred by ONIDsc in a normally functioning immune system. Seven
   different immune cell types were analyzed for each of the five
   controls, resulting in a total of thirty-five cell-type subsets. We
   found between 3 and 406 cells per cell type, and between 269 and 20,984
   genes ([162] Supplementary Figure 3 ). We investigated the shared
   networks across different control subsets. Networks consisted of a
   ranked list of scored regulator-target gene relations. We performed
   clustering of the different cell types based on all relations ([163]
   Supplementary Figure 4A ). We found one shared relation in CD8TC2s
   (cluster 7) and 21 shared relations in MBCs (cluster 4) common to two
   out of the five controls. This finding illustrates the high
   heterogeneity of regulatory relationships in the immune system, likely
   due to the stochastic nature of immune processes at various levels
   ([164]55). Clusters 1, 3, and 4 contained the most common relations
   across two of the seven cell types studied ([165] Supplementary
   Figure 4B ). We analyzed all nodes from the most common relations using
   pathway enrichment analysis (PEA) ([166] Supplementary Figure 4C ).
   Five out of the thirty-five genes were significantly associated with
   pathways including fatty acid and vesicle synthesis, cell cycle,
   mitochondrial RNA processing, and tumor necrosis factor receptor
   superfamily (TNFRSF) binding involved in apoptosis and inflammation.
   Importantly, to discard that the use of IHH patients as controls could
   affect our study, we use a previous large-scale study profiling single
   cells from healthy human blood across 166 individuals aged 25–85
   identified the top 100 differentially expressed genes between major
   PBMC subpopulations ([167]56). Many members of the TNFRSF,
   mitochondrially encoded nicotinamide adenine dinucleotide hydrogen
   ubiquinone oxidoreductase core subunit (MT-ND), and Huntingtin
   Interacting Protein 1 (HIP1) genes were among them ([168]56). Other
   studies have also confirmed the role of Palmitoyl-protein Thioesterase
   1 (PPT1) ([169]57), regulator of chromosome condensation 1 (RCC1)
   ([170]58), and TNFRSF9 ([171]59) in a healthy immune system.
   Additionally, HIP1 receptor expression has been reported to be high in
   normal PBMCs and lymphoid tissues ([172]60). Altogether, these findings
   indirectly validate the genes identified as significant in the control
   population using the ONIDsc method.

   We also analyzed all genes present in one of the five controls for
   CD4TCs and CD8TCs ([173] Supplementary Figure 5A ). The union of these
   genes was subjected to PEA ([174] Supplementary Figure 5B ). We found
   evidence of CD4TC activation, including pathways related to
   mitochondrial activity, Vascular Endothelial Growth Factor (VEGF)
   signaling, SUMOylating in DNA damage repair, and tumor protein p53
   (TP53) acetylation, all of which are characteristic of antiviral
   responses ([175]61). We also found evidence of CD8TC activation, such
   as Rho family GTPase (RND)1–3 cell cycle pathways. RND proteins are
   involved in the regulation of the actin cytoskeleton, cell adhesion,
   and cell cycle, and may also mediate innate defense against viral and
   bacterial infections ([176]62). Finally, we did not observe any signs
   of CD4TC or CD8TC stress, nor signs of CD8TC exhaustion, as expected in
   a healthy and functional immune system.

Three SLE patient dataset: ONIDsc vs SINGE resource comparison:

   We applied SINGE and ONIDsc to the Mistry et al. ([177]7) to compare
   the computational resources required by each method. We selected this
   dataset because it was small enough to allow both methods to be
   applied, while also being biologically more similar to our target
   dataset than those used in the benchmarking section. Eight different
   immune cell types were analyzed for each patient. Definitions of the
   cell types are provided in [178]Supplementary Table 1 . When examining
   total runtime as a function of the size of each of the 24 subsets, we
   found that SINGE ([179] Figures 4A, B ) took 18 times longer to run
   than ONIDsc ([180] Figures 4C, D ; [181]Table 1 ), using the same
   computational resources (1 node, 8 CPUs, and 150 GB RAM). Moreover, our
   analysis showed an exponential increase in total runtime as dataset
   size increased linearly up to 9.8 MB for both methods ([182] Table 1 ).

Figure 4.

   [183]Four scatter plots compare total runtime versus dataset size using
   different cell types and SLE patient groups. Plots A and C show runtime
   versus size (MBs) for SINGE and ONIDsc. Plots B and D show runtime
   against the number of cells and genes. Key indicates cell types and
   patient groups with different colors and shapes.
   [184]Open in a new tab

   ONIDsc vs. SINGE: runtime and scalability comparison on the SLE patient
   dataset. (A) Total run time in hours (hrs) as a function of subset size
   in megabytes (MBs) for each subset analyzed from the Mistry and
   colleagues dataset. (B) Total run time as a function of the number (nr)
   of cells (Left) and genes (Right) for each cell type and patient. (C)
   Total run time as a function of subset size for each cell type and
   patient. (D) Total run time as a function of cells (Left) and genes
   (Right) for each cell type and patient. Eight different cell types are
   represented: MONO (green), PC (light blue), MBC (brown), CD8TC2 (blue),
   CD8TC1 (dark blue), CD8TC17 (black), CD8TREG (red) and CD4TC (orange).
   Three different patients are shown: SLE1 (asterisk), SLE2 (triangle),
   and SLE3 (circle) as indicated in the legend.

Table 1.

   Overview of methods applied to different datasets together with size in
   megabytes (MBs), gene and cell numbers, total run time in hours (hrs)
   and replica number (nr) of the biggest dataset analyzed.
   Method Lambdas Dataset Size (MBs) Gene x cell (nr) Total run time (hrs)
   Replicas (nr)
   SINGE 0,
   0.01,
   0.02,
   0.05,
   0.1 Deshpande and colleagues ([185]30) – 3025 x 3105 207.8 2
   Mistry and colleagues ([186]7) 9.8 15873 x 1022 5077 8
   ONIDsc 0.01 Nehar-Belaid and colleagues ([187]11) 18.1 22550 x 1765
   2421
   Mistry and colleagues ([188]7) 9.8 15873 x 1022 287
   [189]Open in a new tab

   To further characterize the algorithms, we explored additional factors
   that might explain the exponential increase in runtime. Specifically,
   we analyzed runtime as a function of the number of cells and genes per
   dataset independently for both methods. We found that gene count was
   more strongly correlated with total runtime than cell count for both
   SINGE ([190] Figure 4B ) and ONIDsc ([191] Figure 4D ). However, since
   all datasets in this study had more genes than cells, we cannot rule
   out a different trend in datasets where the cell count exceeds the gene
   count.

   We also assessed SINGE’s scalability. When analyzing a 9.4 MB dataset
   with 15,400 genes using the Kay cluster (1 node, 8 CPUs), certain
   hyperparameter combinations, notably lambda = 0 combined with the
   smallest time resolution and number of lags, resulted in a runtime of
   142 hours, which approached the maximum allowed by the longest-running
   cluster partition (LongQ; 144 hours). Increasing the number of CPUs to
   40 on a single node reduced runtime only slightly (to 130 hours), and
   distributing 160 CPUs across 4 nodes did not improve runtime ([192]
   Table 2 ). This indicates that using 1 node with 8 CPUs was the most
   resource-efficient configuration for SINGE, as it consumed the fewest
   core hours. It also highlights that SINGE is not scalable for datasets
   with more than 15,400 genes.

Table 2.

   Number of nodes, number of CPUs, run time (hours; hrs) and core hours
   (hrs) for a single hyperparameter combination (Lambda 0, time
   resolution 3, number of lags 5 and sigma 0.5) run with CD4TCs from a
   single SLE patient of the Mistry and colleagues ([193]7) dataset using
   SINGE.
   Number of nodes Number of CPUs Run time (hrs) Core hours (hrs)
   1               8              142            1136
   1               40             130            5200
   4               160            130            20800
   [194]Open in a new tab

   Tests were performed on Kay cluster LongQ partition.

   Altogether, this resource analysis showed that ONIDsc is significantly
   faster and more scalable than SINGE, primarily due to a reduced number
   of lambda hyperparameters tested and the exclusion of slower,
   suboptimal hyperparameter combinations.

Three SLE patient dataset: ONIDsc application

   Analyzing the regulatory networks inferred using ONIDsc, we found
   networks shared by two out of the three SLE patients for each immune
   cell type analyzed. We then examined the intersecting related genes
   across major immune compartments: innate immune cells (MONOs and LDGs),
   adaptive B cells (PCs and MBCs), adaptive T cells (CD4TCs and CD8TCs),
   and adaptive CD8TC subtypes ([195] Figures 5A, B ). We studied these
   related genes using pathway enrichment analysis (PEA) and found
   evidence of innate immune cell activation, including neutrophil
   degranulation, platelet degranulation, metal sequestration by
   antimicrobial proteins, interferon (IFN) signaling, and regulation of
   toll-like receptors (TLRs). Similarly, in adaptive B lymphocytes, we
   observed metal sequestration by antimicrobial proteins, IFN signaling,
   and TLR regulation ([196] Figure 5C ). Importantly, these results are
   consistent with previous studies reporting overactivation of
   neutrophils, MONOs, and B lymphocytes in SLE ([197]7, [198]63). We also
   found signs of CD8TC response to starvation and stress, including
   mTORC1-mediated signaling, macroautophagy, and activation of
   Ras-related C3 botulinum toxin substrate 1 (RAC1). Prior research has
   shown that autophagy is deregulated in SLE T lymphocytes ([199]64), and
   that RAC protein dysregulation can impair T lymphocyte migration
   ([200]65) and thymic development ([201]66). We did not find networks
   shared across adaptive T lymphocytes, suggesting that CD4TCs and CD8TCs
   were dissimilar among the three patients. To gain further insight into
   the role of T lymphocytes, we analyzed CD4TCs and CD8TCs independently.
   First, we identified genes common to two out of the three SLE patients
   for each T cell type ([202] Figure 6A ). We then analyzed the
   intersection of CD4TC and CD8TC genes using PEA, and found evidence of
   activation, including VEGF signaling, p53 regulation, and T cell
   receptor (TCR) and cytokine signaling ([203] Figure 6B ). We also
   observed signs of CD4TC and CD8TC response to starvation and stress,
   such as macroautophagy-related pathways, and evidence of CD8TC
   exhaustion, including programmed cell death protein 1 (PD-1) signaling.
   This contrasts with the control dataset, where no signs of CD4TC or
   CD8TC stress or CD8TC exhaustion were observed. Therefore, these
   results are consistent with previous observations that CD4TCs are
   overactivated, CD8TCs are exhausted, and autophagy is deregulated in
   SLE T lymphocytes ([204]13, [205]63). They also demonstrate the
   suitability and utility of ONIDsc for analyzing complex patient-derived
   single-cell datasets ([206]14, [207]64).

Figure 5.

   [208]Venn diagrams and network graphs illustrate gene relationships in
   immune cell types, including innate immune cells, adaptive B cells,
   adaptive T cells, and CD8+ T cells. Section A shows overlaps of related
   genes and relations, B depicts pathways with gene interactions, and C
   presents pathway enrichment with p-values ranging from 0.01 to 0.04.
   [209]Open in a new tab

   Shared regulatory networks and pathway enrichment across immune cell
   types in SLE patients. (A) Venn diagram of the number of related genes
   and relations common to innate immune cells, adaptive B cells, adaptive
   T cells, and CD8TCs for 2 out of the 3 patients analyzed in the Mistry
   et al. dataset. Cell types are colored as indicated in [210]Figure 4
   caption. (B) Network of represented by a set of related genes
   (circles), which can be regulators (blue) or targets (green), and
   relations (red arrows). The gene name is written in white inside each
   circle (white). (C) PEA showing significant pathways of the related
   genes (P value< 0.05). P-value is indicated in blue. Common networks
   significantly involved in immune-related pathways for innate and
   adaptive immune cell types were identified.

Figure 6.

   [211]A diagram and scatter plot illustrating gene distribution in T
   cells. Panel A shows a Venn diagram highlighting gene counts in CD4 and
   CD8 T cell subsets, with intersecting regions representing shared
   genes. Panel B is a scatter plot displaying the number of genes along
   various pathways, with dot colors indicating p-values. The pathways are
   listed on the x-axis, while the number of genes is on the y-axis.
   [212]Open in a new tab

   CD4TC and CD8TC gene overlap and pathway activation in SLE patients.
   (A) Venn diagram of the total number of genes (related and unrelated)
   common to CD4TCs and CD8TCs for two out of the three patients analyzed
   from the Mistry and colleagues dataset. Cell types are colored as
   indicated in [213]Figure 4 caption. (B) PEA showing significant
   pathways of all genes from intersecting CD8TCs (P value< 0.05). P-value
   is indicated in blue. CD8TC had signs of exhaustion, such as PD-1
   signaling.

Fifty-six SLE patient dataset: ONIDsc scalability

   Next, we applied ONIDsc to the dataset from Nehar-Belaid and colleagues
   ([214]10) to analyze the computational resources required by ONIDsc on
   the largest dataset identified to date in terms of both patient and
   gene numbers. Seven immune-related cell types were analyzed for each of
   the fifty-six SLE patients, resulting in a total of 392 subsets. We
   found between 2 and 1,889 cells per cell type, and between 2,997 and
   22,577 genes ([215] Supplementary Figure 6A ). For comparison,
   Deshpande and colleagues ([216]30), applied SINGE to a maximum of 3,105
   genes, and a recent benchmarking study of eleven bulk-sequencing
   regulatory network inference algorithms analyzed up to 8,000 genes
   ([217]34). In our case, the largest subset (18 MB) required 2,421 hours
   to run ([218] Table 1 ; [219]Supplementary Figure 6B ).

   To apply ONIDsc, we first explored the topology of four patient subsets
   to assess their complexity. t-SNE plots revealed circular, simple
   single-path topologies ([220] Supplementary Figures 7A–D ). We then
   searched for the optimal lambda range across all datasets, aiming to
   identify a common optimal value that would enable patient-to-patient
   comparisons in downstream analyses. We found that the minimum and
   maximum lambda values followed a severely left-skewed distribution
   ([221] Figure 7A ). The most frequent minimum lambda ranged between 0
   and 0.01 (23% of subsets), while the most frequent maximum lambda
   ranged between 0.03 and 0.05 (27% of subsets). We then examined MSE per
   gene across lambda values within the optimal range for six patient
   datasets ([222] Figure 7B ). For five of the six subsets, patient 56
   CD8TC1, patient 56 CD4TC, patient 56 CD8TREG, patient 24 CD4TC, and
   patient 6 LDG, the MSE was close to zero at the individual minimum
   lambda values and increased significantly for lambdas above 0.02 ([223]
   Figure 7C ). This suggested that, for these subsets, minimum lambdas
   were more effective than maximum values in reducing MSE. Notably,
   patient 12 CD8TC2 showed no change in MSE across all lambda values,
   both within and outside the optimal range ([224] Supplementary Figure 8
   ), indicating that smaller lambdas did not increase MSE in this case.
   Considering all simulations, we selected 0.01 as the common optimal
   lambda, as it was the most frequent and minimized MSE across subsets.

Figure 7.

   [225]Scatter and histogram plots analyze lambda values for patients.
   Panel A shows a scatter plot with maximum and minimum values labeled.
   Panel B includes two histograms showing the frequency of lambda values.
   Panel C presents multiple plots of Mean Squared Error (MSE) versus
   lambda for different patients and cell types, highlighting specific
   points with error bars.
   [226]Open in a new tab

   Optimal lambda distribution and MSE profiles across subsets in the
   56-patient SLE dataset. (A) Minimum (red) and maximum (green) optimal
   lambda values per patient and (B) distribution of minimum and maximum
   optimal lambda values for all subsets analyzed from the Nehar-Belaid
   et all dataset. The percentage of subsets with the most frequent
   minimum and maximum lambda ranges are represented in red (23% (0 -
   0.01)) and green (27% (0.03 - 0.05)), respectively. (C) MSE
   distribution over the range of optimal lambdas for six randomly sampled
   subsets: patient 56 CD8TC1, patient 56 CD4TC, patient 56 CD8TREG,
   patient 12 CD8TC2, patient 24 CD4TC and patient 6 LDG. The lambda
   distributions of these subsets are shown in [227]Supplementary Figure 8
   . The interquartile range of MSE for each of the genes in the subset is
   represented by bars. A significant difference between two data points
   is indicated with an asterisk (P value< 0.05).

   To further test ONIDsc, we analyzed relations common to four out of the
   fifty-six patients for LDGs and MBCs ([228] Supplementary Figures 9 ,
   [229]10A ). Using PEA, we found evidence of neutrophil activation
   (e.g., neutrophil degranulation, IFN signaling, and cytokine signaling)
   and MBC activation (e.g., TLR regulation) ([230] Supplementary
   Figures 9 , [231]10B ). These results were in line with previous
   observations that found neutrophils and B cells to be overactivated in
   SLE ([232]7, [233]63). We also analyzed genes shared by at least two of
   the fifty-six patients for CD4TCs and CD8TCs ([234] Figure 8A ). The
   union of these genes was analyzed using PEA ([235] Figure 8B ). We
   found signs of CD4TC activation (e.g., antigen processing, TCR
   signaling, and cytokine signaling), CD8TC activation (e.g., antigen
   processing and TP53 regulation), CD8TC response to starvation and
   stress (e.g., macroautophagy), and CD8TC exhaustion (e.g., PD-1
   signaling). These results again contrasted with the control dataset,
   where no signs of CD4TC or CD8TC stress or CD8TC exhaustion were
   observed, and were consistent with previous findings that CD4TCs are
   overactivated, CD8TCs are exhausted, and autophagy is deregulated in
   SLE T lymphocytes ([236]14).

Figure 8.

   [237]Diagram A shows gene distribution with a circle for CD4TCs
   containing 13,478 genes and a Venn diagram for CD8TCs. CD8TC1 and
   CD8TC2 share 271 genes, intersecting with CD8TREG (247) and CD8TC17
   (2710). Diagram B is a scatter plot displaying the number of genes
   versus pathways, with dot colors indicating p-values. Pathways are
   labeled along the horizontal axis.
   [238]Open in a new tab

   Common CD4TC and CD8TC genes and pathways across multiple SLE patients.
   (A) Venn diagram of the total number of genes (related and unrelated)
   common to CD4TCs and CD8TCs for two or more patients analyzed from the
   Nehar-Belaid and colleagues dataset. The number of genes present in all
   CD8TCs is shown in black. Cell types are colored as indicated in the
   [239]Figure 4 caption. (B) PEA showing significant pathways of all
   genes from intersecting CD8TCs (P value< 0.09). P-value is indicated in
   blue. CD8TCs had signs of exhaustion, such as PD-1 signaling.

Fifty-six SLE patient dataset: ONIDsc predictions

   Finally, we investigated shared gene regulatory relationships and
   disease-associated pathways across different SLE patients and immune
   cell types. We performed clustering of the cell types based on the most
   common regulatory relations among patients ([240] Figure 9A ). The
   unfiltered number of relations revealed a large and heterogeneous set
   of interactions between the different cell types ([241] Table 3 ). We
   aimed to identify the most common relationships across both patients
   and cell types. To ensure a homogeneous number of nodes per cell type,
   we selected the top 60–105 most common genes per cell type, applying a
   variable common threshold across cell types ([242] Table 3 ). The
   specific relations per cluster are provided in [243]Supplementary File
   4 . Using a variable common threshold is a strategy commonly employed
   in systems with dynamic behavior, such as time series data analysis,
   where models may switch between regimes based on lagged variables
   ([244]67) or in particle image processing, where it improves boundary
   detection, sizing accuracy, and shape parameter estimation ([245]68).
   It must be noted that while this can lead to more nuanced, flexible and
   accurate decision-making, it can also increase complexity and potential
   biases. We found that most clusters per cell type exhibited low patient
   commonality, with the most common clusters (e.g., clusters 2, 3, 5, and
   10) being present an average of seven patients (range: 3–15) across at
   least three of the five T cell types analyzed, a result expected given
   the high heterogeneity of SLE. We compared the most common SLE-specific
   relations (from clusters 2, 3, 5, and 10) with those found in control
   samples (from clusters 1, 3, and 4). Only one relation was shared
   between SLE and controls in CD4TCs: matrix remodeling-associated
   protein 8 (MXRA8) and MT-ND6. This suggests that the remaining common
   SLE relations are likely disease-specific and not part of normal immune
   function.

Figure 9.

   [246]A composite image with three sections: A) A heatmap illustrating
   relations between patients with clusters of varying sizes. B) A network
   diagram showing gene clusters with connections between genes like
   MT-ND6, CHD5, and UBXN11. C) A dot plot presenting genes and their
   associated pathways with p-values, indicating gene-pathway
   interactions.
   [247]Open in a new tab

   Clustering of shared regulatory relations across cell types and
   patients in SLE. (A) Clustering of the different cell types analyzed
   from the Nehar-Belaid and colleagues’ dataset. Each cluster is formed
   by a set of relations. The maximum number of patients with the same
   cluster is represented with colors that range from 0 (blue) to 14
   (red). The most common genes per cell type were selected using a common
   threshold ([248] Table 3 ). (B) Network of clusters 2, 3, 5 and 10
   represented by a set of related genes (circles), which can be
   regulators (blue) or targets (green), and relations (red arrows).
   Regulators can affect one (light blue) or more (dark blue) targets. The
   name of the gene is written in white inside each circle (white). (C)
   PEA showing significant pathways of the network genes from clusters 2,
   3, 5 and 10 (P value< 0.05). The p-value is indicated in blue.

Table 3.

   Overview of most common networks per cell type.
              Network features        TH    TREG  TC1    TC17     MBC    TC2 LDG
         Unfiltered Nr of Relations 4310973 57583 1416 13207877 19933317
30454033          16285128
         Common Variable Threshold     6      2    1      4        4      4   4
                   Genes              62     60    60     78      105    76  65
                 Relations            71     72    57     68       97    66  55
   [249]Open in a new tab

   The unfiltered number of relations shows there is a large number of
   relations and a significant cell type heterogeneity.

   We selected the most common relations across cell types ranging between
   60-105. This was done using a common variable threshold defined as the
   minimum number of patients, out of the total 56 SLE patients analyzed,
   that had a given relation.

   We performed PEA on all related nodes from the networks forming the
   most common clusters ([250] Figure 9B, C ). Six out of twenty-six genes
   were significantly associated with known pathways. MT-ND6, present in
   all clusters and cell types, was significantly linked to aerobic
   respiration pathways, including the TCA cycle and electron transport.
   Previous studies have shown that reduced MT-ND6 expression in SLE
   patients is associated with increased inflammatory CD4TC and CD8TREG
   death ([251]69, [252]70). However, we also found MT-ND6 in all control
   cell types, suggesting its presence in SLE may reflect normal immune
   function rather than disease-specific activity. Solute carrier family
   66 member 1 (SLC66A1), found in cluster 5 and present in CD4TC, CD8TC1,
   CD8TREG, and LDGs, is related to lysosomal transport. Encouragingly,
   previous studies have shown lysosomal transport dysfunction in
   phagocytes and B cells in SLE patients ([253]71). Ultrabithorax Domain
   Protein 11 (UBXN11), found in clusters 2 and 10, was present in all
   cell types ([254] Supplementary Figures 9 –[255] 15 ) and is associated
   with the RND1–3 cycle in LDGs, CD8TREGs, and CD8TC1s. nicotinamide
   adenine dinucleotide kinase (NADK), RNA Polymerase III Subunit GL
   (POLR3GL), MXRA8, found in cluster 5, were present in CD4TC and CD8TREG
   and are linked to nicotinate metabolism, RNA transcription, and protein
   phosphorylation, respectively. Importantly, to the best of our
   knowledge, none of these four genes have been previously implicated in
   the pathophysiology of SLE. However, their functions are of potential
   relevance as they are involved in mechanisms deregulated in SLE. NADK
   is critical for redox balance in immune cells metabolism, POLR3GL may
   influence innate immune sensing of cytosolic DNA, a pathway known to be
   hyperactive in SLE, and MXRA8, a matrix remodeling protein, has been
   linked to viral entry and immune signaling ([256]72). NADK, POLR3GL,
   and UBXN11 were not detected in the control datasets. Given their
   absence in controls and potential functional relevance, and the limited
   prior research on their roles in SLE, we propose MXRA8, NADK, POLR3GL,
   and UBXN11 as candidate genes for further investigation in the context
   of systemic lupus erythematosus.

Discussion

   We developed a regulatory network reconstruction algorithm, ONIDsc,
   that enhances GLG causality to identify molecular disease networks in
   SLE. This enhancement was achieved by applying cyclical coordinate
   descent to optimize the lambda penalty, improving model performance.
   GLG causality is based on SINGE, an algorithm well-suited for building
   networks from pseudotemporally ordered single-cell datasets with
   irregular time series. While SINGE includes subsampling, zero handling,
   and aggregation to improve robustness, it uses a limited set of
   predefined hyperparameter combinations, which can reduce performance,
   increase runtime, and limit scalability. Comparison with other network
   inference methods shows that ONIDsc consistently outperformed SINGE
   and, when assessing overall performance, ONIDsc also outperformed all
   the other network inference methods used to benchmark our algorithm for
   the ESC to endoderm differentiation dataset. This dataset’s gold
   standards were primarily derived from ChIP-chip and ChIP-seq data.
   Prior studies have shown that networks inferred using curated TF-gene
   interactions from ChIP-based experiments are more accurate than those
   based solely on gene expression data ([257]73, [258]74). This may be
   because ChIP-derived interactions are direct (causal), while
   expression-based interactions can be indirect (correlational). In
   contrast, SINCERITIES, JUMP3, and GENIE3, which are optimized for gene
   expression data, outperformed ONIDsc on the retinoic acid dataset,
   where 48% of interactions came from LoF and GoF data.

   ONIDsc is applicable across datasets when prior knowledge of cell type
   markers and process-driving genes is available. We used a supervised
   approach to improve accuracy and reliability, though this requires
   labeled data, which may be difficult to obtain. In such cases,
   unsupervised methods may be more appropriate, though they are harder to
   evaluate and more computationally intensive, especially for large
   datasets ([259]75). We applied ONIDsc’s pseudotime inference to
   benchmarking datasets and found it performed significantly better than
   Monocle in early metrics when the dataset topology was linear or
   circular. However, pseudotime inference becomes more challenging in
   tree-like or bifurcated topologies ([260]76, [261]77). While ONIDsc’s
   pseudotime method was suitable for the datasets studied, more complex
   topologies may require alternative approaches.

   ONIDsc achieved higher precision, lower MSE, and faster runtimes than
   SINGE by selecting optimal and efficient hyperparameter combinations. A
   key advantage is its ability to scale up for larger patient datasets,
   which was not feasible with SINGE. Resource analysis showed that
   runtime correlated more strongly with gene count than cell count for
   both methods. However, since all subsets had more genes than cells, we
   cannot exclude different trends in datasets with more cells than genes.
   McCalla et al. ([262]34) studied the effect of gene numbers (10 to 8000
   genes) over run time for various network inference algorithms and
   similarly found an exponential increase. However, all datasets in their
   study had the same number of cells (5520). Therefore, the effect of
   different cell numbers on algorithm run time was not assessed.
   Interestingly, for datasets with more cells than genes, the exponential
   increase in run time when increasing genes linearly was conserved.

   Applying ONIDsc to two SLE datasets demonstrated its ability to
   accurately reconstruct regulatory networks at scale (>15,000 genes)—a
   task not possible with SINGE. In both datasets, ONIDsc identified genes
   and pathways related to innate and adaptive B cell activation,
   consistent with previous findings of neutrophil, MONO, and B cell
   overactivation in SLE ([263]7, [264]63). These cell types are involved
   in autoantibody production, antigen presentation, and cytokine
   secretion, all of which are dysregulated in SLE. In particular, TLR
   regulation is critical for immune tolerance, and its disruption can
   lead to autoimmunity ([265]78). We also found signs of CD4TC
   activation, which contributes to autoantibody production and
   inflammation and is linked to SLE pathogenesis ([266]14). Our data
   shows that CD8TCs were activated and exhausted due to the presence of
   the PD-1 signaling pathway. PD-1 is an inhibitory receptor expressed in
   response to continuous TCR stimulation without co-stimulatory
   molecules. In SLE, elevated PD-1 expression may impair CD8TC
   cytotoxicity, increasing infection risk and potentially triggering
   autoimmunity ([267]13, [268]14). Finally, PD-1 is essential in
   maintaining CD8TC immune tolerance to tissue antigens by inhibiting T
   cell effector differentiation ([269]79). Aerobic respiration pathways
   were enriched in CD8TCs and MBCs due to MT-ND6 expression. Reduced
   MT-ND6 levels in SLE have been linked to mitochondrial dysfunction,
   increased ROS, and ATP deficiency, promoting inflammatory T cell death
   ([270]69, [271]70). Interestingly, we also found cell death pathways in
   CD8TREGs ([272] Supplementary Figure 12B ), suggesting that aerobic
   respiration may be relevant across all CD8TCs and MBCs. Moreover, the
   lysosomal transport pathway was associated with LDGs, CD4TCs, and most
   CD8TC subsets due to the presence of SLC66A1. Studies have shown this
   pathway to be dysfunctional in phagocytes and B cells in SLE patients
   ([273]71, [274]80). Our findings suggest that lysosomal transport may
   also play a role in T cell dysregulation in SLE. The Rho GTPase
   signaling pathway, which regulates actin cytoskeleton dynamics and cell
   proliferation ([275]81), was found to be significant in LDGs and some
   CD8TC subsets due to the presence of UBXN11. While RhoA GTPase has been
   proposed as a therapeutic target in SLE ([276]82), UBXN11 has not
   previously been linked to the disease.

   ONIDsc identified four genes, NADK, POLR3GL, MXRA8 and UBXN11, present
   in CD4TCs, CD8TREGs, CD8TC1s, and LDGs in SLE patients but absent in
   controls. The selected clusters containing these genes were shared by
   an average of seven patients, which is consistent with the high
   heterogeneity of SLE. Remarkably, these genes have not been shown
   before to be involved in SLE, but there is evidence that supports a
   role in the disease. These findings have also improved the robustness
   and generalizability of our method findings. Thus, these genes were
   associated with the nicotinate metabolism, RNA transcription, protein
   phosphorylation, and Rho GTPase signaling, respectively. A study of the
   literature strongly supports the role of these proteins in SLE, as they
   are involved in physiological functions commonly deregulated in this
   disease. The nicotinate metabolism pathway has been implicated in SLE
   and other inflammatory diseases ([277]83). Interestingly, Serum
   metabolomic studies have shown that abnormalities in related pathways,
   such as nucleoside metabolism, may reflect immune cell dysfunction,
   particularly in energy-demanding processes like activation and
   proliferation ([278]23). MXRA8, a cell surface adhesion receptor, is
   involved in extracellular matrix remodeling, endothelial and epithelial
   interactions, and viral entry ([279]84). Dysregulation of MXRA8 may
   increase vascular permeability and immune cell infiltration,
   contributing to tissue damage in SLE ([280]85). Finally, UBXN11,
   expressed in multiple immune cell types, belongs to a protein family
   known to regulate NF-κB and type I IFN pathways, both of which are
   central to SLE pathogenesis ([281]86–[282]88). Further experimental
   validation is needed to confirm the involvement of these genes in SLE.
   Nonetheless, our findings demonstrate that ONIDsc is a versatile and
   scalable tool for identifying physiological and pathological mechanisms
   in immune cells from complex single-cell datasets.

Limitations of our study

   Apart from the kernel sparsity, ONIDsc depends on three other
   hyperparameters that control the kernel smoothness and the window of
   past expression of candidate regulators. Currently, a reduced set of
   specific hyperparameter values and pairs are considered instead of all
   possible combinations. Thus, the algorithm could be further optimized
   by using grid search approaches to find optimal kernel hyperparameters.
   Nevertheless, it is not clear that this would yield significant
   performance improvements ([283]30). Another way to improve ONIDsc
   algorithm could be by integrating different pseudotime-based methods.
   This could result in higher precision and certainty that the order
   really represents the process of study. ONIDsc current cell ordering
   method assumes linearity in the cell topology by ordering cells on a
   single dimension based on the normalized sum to the expression values
   of a set of predefined features, which are the main drivers of the
   process of study. Nevertheless, ONIDsc has the flexibility to
   incorporate different topologies and even combining real-time points
   with pseudotime inference methods. Other pseudotime algorithms, like
   tools for single cell analysis (TSCAN) ([284]89), are better suited for
   linear topologies, while Monocle ([285]28) or cell lineage and
   pseudotime inference for single-cell transcriptomics (Slingshot)
   ([286]90) are better suited for tree-like topologies and selective
   locally linear inference of cellular expression relationships (SLICER)
   ([287]91) is better suited for circular trajectories. Furthermore, cell
   trajectory inference using time-series single-cell RNA sequencing data
   (Tempora), has been developed to combine real-time points with
   pseudotime ([288]92). Due to the different assumptions of the
   algorithm’s commonalities are not to be expected. Thus, the adequacy of
   the different cell ordering methods could be evaluated through
   benchmarking ([289]93). Data exploration methods could be developed to
   assess which method is better suited for on a given dataset.

Acknowledgments