Abstract Single cells are typically typed by clustering into discrete locations in reduced dimensional transcriptome space. Here we introduce Stator, a data-driven method that identifies cell (sub)types and states without relying on cells’ local proximity in transcriptome space. Stator labels the same single cell multiply, not just by type and subtype, but also by state such as activation, maturity or cell cycle sub-phase, through deriving higher-order gene expression dependencies from a sparse gene-by-cell expression matrix. Stator’s finer resolution is clear from analyses of mouse embryonic brain, and human healthy or diseased liver. Rather than only coarse-scale labels of cell type, Stator further resolves cell types into subtypes, and these subtypes into stages of maturity and/or cell cycle phases, and yet further into portions of these phases. Among cryptically homogeneous embryonic cells, for example, Stator finds 34 distinct radial glia states whose gene expression forecasts their future GABAergic or glutamatergic neuronal fate. Further, Stator’s fine resolution of liver cancer states reveals expression programmes that predict patient survival. We provide Stator as a Nextflow pipeline and Shiny App. Keywords: Higher-order Gene Expression Dependencies, Single-cell Transcriptomics, Structure Learning, Cell State, Cell Cycle Phases Subject terms: Chromatin, Transcription & Genomics; Computational Biology; Methods & Resources Synopsis graphic file with name 44320_2024_74_Figa_HTML.jpg Stator assigns multiple states and (sub)types to a single cell using coordinated gene expression and/or non-expression in a data-driven manner. Its application of structure learning has increased power for quantifying higher-order expression dependencies. * Stator does not require PCA or UMAP dimensionality reduction, and its states are not based on cell proximity in expression space. * Cells can be assigned by Stator simultaneously to cell type, subtype and state (e.g., activation, maturity and cell cycle sub-phase). * Stator states are inferred with uncertainty quantification and are reproducible in disjoint data sets. __________________________________________________________________ Stator assigns multiple states and (sub)types to a single cell using coordinated gene expression and/or non-expression in a data-driven manner. Its application of structure learning has increased power for quantifying higher-order expression dependencies. graphic file with name 44320_2024_74_Figb_HTML.jpg Introduction To attribute disease to cell type, and molecular features of cells to disease state, we need to define and distinguish cell types, subtypes and states (Dann et al, [37]2023). We follow others (Fleck et al, [38]2023; Morris et al, [39]2019; Wagner et al, [40]2016) by distinguishing cell types—with their more permanent phenotypic features—from cell states, whose features are transient and can be elicited by stimulus; cell subtypes are sub-populations of the same cell type that share distinctive features. The Human Cell Atlas (Regev et al, [41]2017) has taken a step in this direction by seeking definition of all human cell types and their molecular features, most often gene expression, within a multidimensional ‘cell space’ (Regev et al, [42]2017). Typing of cells is easiest when their lineages are well separated, and hardest when they are distinguished only by state (such as cell cycle phase, level of maturity or response to stimulus) or spatial location. Two stages of dimensionality reduction are commonly used in scRNA-seq analysis pipelines. The first is the projection of cells into e.g., [MATH: K≤50 :MATH] dimensions (or e.g., [MATH: K :MATH] determined via a scree plot for Principal Component Analysis (PCA)) using methods that, whilst not fully preserving Euclidean distances in lower dimensions, produce an embedding whose distances are still quantitatively meaningful. This includes PCA, which features in Scanpy (Wolf et al, [43]2018) and Seurat’s (Stuart et al, [44]2019) default pipelines. This dimensionality reduction is done to reduce noise and to avoid the curse of dimensionality in downstream analyses that rely on the quantification of Euclidean distances, such as clustering, which would otherwise break down in high dimensions (Aggarwal et al, [45]2001). Clustering is often used to define a cell type as a collection of cells that group more closely in gene expression space than other cells. This approach has yielded cell-type definitions at relatively low resolution, but requires additional analyses to begin resolving states within continuous trajectories of cell-state change (Dann et al, [46]2022; Kotliar et al, [47]2019; Ponting, [48]2019). The second stage of dimensionality reduction, often used in scRNA-seq analysis pipelines to date, further reduces the K-dimensional space to 2 or 3 dimensions, often using t-distributed Stochastic Neighbour Embedding (tSNE) (van der Maaten and Hinton, [49]2008) or Uniform Manifold Approximation and Projection (UMAP) (McInnes et al, [50]2020), for qualitative and exploratory analysis through visual inspection. Due to the lack of guarantees on distance preservation (for tSNE and UMAP), such extreme dimensionality reduction (even for PCA) inevitably results in significant distortions (Chari and Pachter, [51]2023; Cooley et al, [52]2022). Cells adopt a continuum of states, representing cellular activities such as the cell cycle or responses to stimuli (Kotliar et al, [53]2019; Xia and Yanai, [54]2019). Labelling cells only by type thus does not finely resolve their dynamic behaviour such as during development or disease (Morris, [55]2019). Cell states are currently predicted by PCA (Shalek et al, [56]2014; Steuerman et al, [57]2018), Independent Component Analysis (ICA) or Non-Negative Matrix Factorisation (NMF) (Puram et al, [58]2017; Saunders et al, [59]2018). However, components or factors inferred by these algorithms may not faithfully or finely resolve cellular processes. States previously predicted by NMF among cancer cells, for example, include non-specific descriptors, such as ‘stress’, ‘metal response’ and ‘basal’ (Barkley et al, [60]2022). Automatic cell annotation methods, such as CellTypist (Conde et al, [61]2022) and foundation models such as (Cui et al, [62]2023), are not intended to identify cell (sub)types de novo in a data-driven manner. This is because training of these models requires pre-existing cell-type annotations. Such cell-type annotations are curated, and so are susceptible to subjective bias (see 'Results'). These methods are also not intended to annotate states present across diverse cell (sub)types. To identify both cell (sub)types and states at high resolution, we introduce Stator, which eschews cell clustering and instead defines states using the coordinated expression and non-expression of genes in single cells. Higher resolution is achieved by taking advantage of expression interactions at higher than pairwise ( [MATH: 3≤n≤7 :MATH] ). Expression interactions that commonly co-occur in cells are gathered together as a single state label. The method yields biologically compelling labels of type, subtype and state for cells in healthy and disease contexts without invoking concepts of expression space or pseudotime. These states are neither necessarily proximal in gene expression space nor necessarily categorical, thereby capturing the continuous nature of cell states and, in some cases, previously defined types. As with all cell state or (sub)type markers, Stator labels do not necessarily imply molecular mechanism. Rather, Stator reveals molecular and cellular heterogeneity and dynamics that would otherwise have been overlooked but can now be investigated experimentally. We show how Stator predicts, in a data-driven manner, sub-phases of the cell cycle, capturing transcriptional dynamics across each cell cycle phase, the future neuronal (sub)type fate of immature cell precursors, rare disease-associated human endothelial cell subtypes and cycling transformed hepatocytes whose expressed genes are predictive of liver cancer survival. Results Stator’s high-level workflow is illustrated in Fig. [63]1 with each step detailed in 'Methods'. Briefly, after performing standard Quality Control (QC) (Luecken and Theis, [64]2019), including doublet removal (Wolock et al, [65]2019), it initially restricts consideration to the most highly variable genes (HVG; often [MATH: N=1000 :MATH] ) (Wolf et al, [66]2018) followed by binarisation of gene expression. Binarisation does not substantially alter conclusions when analysing sparse data (Bouland et al, [67]2021, [68]2023; Qiu, [69]2020). Input to Stator is a cell [MATH: (M) :MATH] by binary gene expression ( [MATH: G :MATH] ) matrix (Fig. [70]1A). The model-free estimator of higher-order interactions (MFI) we introduced in (Beentjes and Khamseh, [71]2020) then estimates [MATH: n :MATH] -point interactions among [MATH: n=2,3,…< mo>,7 :MATH] genes (Fig. [72]1B). Comparison between this estimator of dependence and other estimators such as correlation and mutual information is presented in Fig. [73]EV1 and Appendix Fig. S[74]1 (Jansma, [75]2023a). In the next step (Fig. [76]1C), “d-tuples” are extracted. “d-tuples” are defined as gene tuples that significantly drive interactions ('Methods'). This step is achieved by comparing the expression of each tuple of genes in the MFI estimator to their expression under the null distribution of independence (see 'Methods' for full details). Figure 1. Workflow of Stator. [77]Figure 1 [78]Open in a new tab Steps (A–E) are fully data-driven; steps (F, G) require biological interpretation. (A) A quality controlled, cell ( [MATH: M :MATH] )-by-binarised gene expression ( [MATH: G :MATH] ) matrix is used as input. (B) [MATH: n :MATH] -point model-free interactions (MFI) are estimated ( [MATH: n=2,…,7 :MATH] ) from the graph of conditional dependencies among the genes. Green edges denote positive values, red edges denote negative values, the larger green triangle represents a positive 3-point dependence. Prior to this, the graph is inferred with an MCMC graph-optimisation algorithm on an initial structure obtained by the Peter–Clark causal discovery algorithm. The graph itself is not used to claim causation, rather to improve the statistical power of detecting [MATH: n :MATH] -point interactions among genes with strong inter-dependencies. (C) Tuples that are significantly deviating (default: FDR [MATH: < :MATH] 0.05, [MATH: log2 (FC)>3 :MATH] ) as compared to the null hypothesis of independence (interaction = 0) are extracted. These gene combinations are deviating tuples, or “d-tuples”. The significant tuple in this example is [MATH: (G1,G2,G3< /msub>)=(1,1,1) :MATH] but d-tuples containing zero-values representing unexpressed genes are also found. Red and blue represent even and odd numbers of expressed genes (equal to 1) in the MFI. (D) A binary cell ( [MATH: M :MATH] )-by-d-tuple ( [MATH: K :MATH] ) matrix is created. Entries with 1 indicate a cell containing a significant given tuple, in this example cells with [MATH: (G1,G2,G3< /msub>)=(1,1,1) :MATH] . Entries with a zero represent cells not containing the d-tuple. The matrix is created using all [MATH: K :MATH] significant interactions and corresponding d-tuples. (E) Hierarchical clustering of d-tuples (rather than cells) is performed to group any d-tuples that co-occur unusually often in single cells. The dendrogram is cut, by default, at a Dice similarity that maximises the modularity score (Newman, [79]2006), but is adjustable. This procedure results in groups of d-tuples that can contain both the presence and absence of a gene’s expression. (F) At this stage, the user annotates and interprets the groups of d-tuple genes to infer cell states. Unlike clustering of cells, this procedure can result in cells that exist in multiple biological states simultaneously. (G) A Shiny App in R enables the user to compare Stator states against external annotations, such as other data-driven or expert annotations. Left: The horizontal box represents the significant enrichment of several Stator states in cells with a specific externally annotated cell type, demonstrating the existence of multiple cell subtypes that could be explored further. The vertical box represents Stator states spanning multiple externally annotated cell types, representing non-cell-type restricted biological states, e.g., cell cycle phases. The user can also choose to compare Stator state enrichment against biological conditions of an experiment. Stator’s Shiny App allows further integrative analyses, such as differential expression of Stator states or Gene Ontology term enrichment (Aleksander et al, [80]2023; Ashburner et al, [81]2000; Sayols, [82]2023; Wu et al, [83]2021; Yu et al, [84]2012). We provide a step-by-step guidance for labelling Stator states in 'Methods', section “Assigning labels to Stator States”. Figure EV1. Expanded view for Fig. [85]1. [86]Figure EV1 [87]Open in a new tab Comparison of MFIs with other estimators of dependence. Different causal dynamics lead to different association metrics, and only MFIs can distinguish all 6 scenarios and reveal the combinatorial effect of a multiplicative interaction. Green edges denote positive values, red edges denote negative values, circles denote a 3-point quantity, and dashed lines show edges that show marginal significance that depends on the level of simulated noise. Correlations and mutual information cannot distinguish between most dynamics, and while partial correlation can, for certain noise levels, identify the correct pairwise relationships, it falls short of distinguishing additive from multiplicative dynamics. See Appendix Fig. S[88]1 for the simulation parameters and precise values. Reproduced from (Jansma, [89]2023a) with permission from the author. Next, a new matrix of cell ( [MATH: M :MATH] )-by-binary d-tuple ( [MATH: K :MATH] ) is created (Fig. [90]1D). Entries with 1 in the matrix indicate cells with that particular d-tuple gene expression combination; entries with 0 do not contain that given gene expression combination. Stator next performs hierarchical clustering of gene d-tuples based on these d-tuples’ co-occurrence in single cells (Fig. [91]1E). Absence of expression for a gene is denoted by a minus, e.g., [MATH: G1− :MATH] and [MATH: G2− :MATH] on the rightmost branch in Fig. [92]1E. Crucially, this clustering takes place in a restricted space of d-tuples. Consequently, rather than a cell being placed at a single location in gene expression space, as is usual in scRNA-seq analyses, Stator allows for cells to adopt multiple biological states (Fig. [93]1F). We show below that a single cell can be thrice (or more) labelled (Appendix Fig. S[94]2), for example as a radial glial-like precursor cell, as an astrocyte progenitor and a cell in G2/M cell cycle phases. Once groups of combinatorial gene signatures are identified, users can tune the modularity parameter that varies the granularity at which Stator states are resolved. Stator’s memory and run time are discussed in 'Methods' and Appendix, Section A.1, Appendix Figs. S[95]3 and S[96]4. Stator states are definable not just by d-tuple genes but also by other genes that are significantly differentially expressed relative to all other states or one other state (Fig. [97]1G): these are state-to-other DEGs (s2o-DEGs, adjusted p value [MATH: <0.05 :MATH] , [MATH: ∣log2 (FC)∣>0.25 :MATH] ) and state-to-state differentially expressed genes (s2s-DEGs, adjusted p value [MATH: <0.05 :MATH] , [MATH: ∣log2 (FC)∣>0.25 :MATH] ), respectively (Fig. [98]1G). For this analysis, the unbinarised expression values of all genes (not only the [MATH: 1000 :MATH] HVG) are considered using existing methods, e.g., (Stuart et al, [99]2019). The app further permits Stator states to be queried for their enrichment in previously derived annotations, such as experimental condition (healthy versus disease, different time points), or cell (sub)type or biological state labels. To demonstrate Stator’s ability to identify cell states and (sub)types, we investigated three published scRNA-seq datasets in normal and disease contexts, in embryological or adult tissue, and in human or mouse. This first set contains astrocyte and neuron progenitors from mouse late embryonic (E18) brain (10XGenomics, [100]2017), chosen because this is the developmental stage when astrocytogenesis occurs and when cortical radial glial precursors (RPs) asymmetrically divide to generate neurons in the developing mouse cortex (Akdemir et al, [101]2020; Rubenstein et al, [102]2020). The second is scRNA-seq data from human liver cells from disease (cirrhosis) and control donors (Ramachandran et al, [103]2019). Thirdly, we applied Stator to human liver cancer (hepatocellular carcinoma) cells (Barkley et al, [104]2022). Biological validity of a Stator state is provided when its d-tuple genes, s2o-DEGs and/or s2s-DEGs occur in a common cellular process and/or marker gene set; for further details see 'Methods' “Assigning labels to Stator states”. We start by showing how Stator identifies cells present in only a portion of a cell cycle phase before then revealing cell subtypes and states that had hitherto not been inferred from these datasets. For each dataset, we compare Stator with NMF (Barkley et al, [105]2022; Gaujoux and Seoighe, [106]2010). We further compare Stator’s output on the RPs’ dataset with (i) clustering with pairwise significance quantification (Gao et al, [107]2022; Stuart et al, [108]2019) (Fig. [109]2A; Appendix Fig. S[110]6), (ii) LDVAE (deep learning-based) (Gayoso et al, [111]2022; Svensson et al, [112]2020), (iii) LDA (topic modelling) (Blei et al, [113]2003; Gayoso et al, [114]2022; Srivastava and Sutton, [115]2017), and (iv) its cell cycle states with Tricycle (Zheng et al, [116]2022b). Analyses (ii)–(iv) are presented in 'Methods' “Comparison with other methods”. Figure 2. Stator identifies cell states in seemingly homogeneous mouse embryo radial glial cell-like precursor cells. [117]Figure 2 [118]Open in a new tab Stator identifies 25 signatures at maximum modularity. We have labelled 23 of 25 Stator states by performing differential gene expression analysis between cells in one state and all other cells, followed by gene enrichment (GO/KEGG) analyses. The significant differentially expressed genes were also compared with known gene markers of cell types and states. In such cases, we required at least three marker genes to be highly expressed. (A) Barplot colours indicate the proportion of cells captured by each cluster following hierarchical clustering of cells (see Appendix Fig. S[119]6) that resulted in two significantly different clusters only. Right-hand side: three exemplar Stator states (#2, 8, 11) are highlighted in a PCA embedding of the unbinarised expression data, and annotated with the number of cells they label, the d-tuples from which they are defined, and their five most common d-tuple genes. (B) Numbers of cells labelled with any one of 7 cell cycle states (# 3, 4, 6, 7, 8, 9 and 11); areas of circles are proportional to their number (see legend). Filled circles indicate numbers of cells labelled with only one of these single cell cycle states. Grey circles’ areas indicate numbers of cells labelled with two cell cycle states, those indicated by lines. Numbers of significantly differentially expressed genes between cell cycle state pairs (i.e., s2s-DEGs) are provided between the two states being compared; their colours refer to the state showing higher expression. For clarity, state pairs with [MATH: ≥ :MATH] 25 cells are shown. DEGs between any two states, including state pairs with fewer than 25 co-labelled cells, are provided in Dataset EV[120]5. Appendix Fig. S[121]8A additionally provides the number of co-labelled cells between any two states. Right: s2s-DEGs are indicated by “ [MATH: > :MATH] “ or “ [MATH: < :MATH] “ symbols; for example, Hells mRNA expression is significantly higher in State # 4 over States # 11, 6, 9, 8 and 7. Early/late G1/S or G2/M cell cycle phase labels (top) were assigned using these mRNAs’ cell cycle phases known from high-throughput (top right; (Giotti et al, [122]2018)) and targeted experiments (Ung mRNA in late G1/S (Slupphaug et al, [123]1991) and Cenpa in G2 (Shelby et al, [124]1997)). (C) Heatmap of expression level (z-score) for genes upregulated in state #18, versus other states, for cells in State #18 and a random selection of cells from other groups ( [MATH: n=1500 :MATH] ). Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation throughout this study. Genes were ordered by hierarchical clustering. Upregulated genes are significantly involved in Cilium assembly ([125]GO:0060271; [MATH: q=3×10−11 :MATH] ). (D) Heatmap of expression level (z-score) for genes upregulated in state #22, versus other states, for cells in state #22 and a random selection of cells from other groups (n = 500). Genes were ordered by hierarchical clustering. Upregulated genes reveal state of metaphase/anaphase. (E) Dot plot illustrating differential expression of astrocytogenesis marker genes across all Stator states. The size of the dots represents the - [MATH: log10(Seuratp-val-adj) :MATH] from differential expression testing between a state and all other states. Colour intensity represents the [MATH: log2 (FC) :MATH] of gene expression. Stator identifies states in seemingly homogeneous cells We first applied Stator to 11,950 E18 mouse brain cells (Methods 'Datasets'). These highly express canonical markers (e.g., Slc1a3, Mt3 and Mfge8 (Yuzwa et al, [126]2017)) of embryonic radial glial precursors (RPs), which later develop into astrocytes or neurons via intermediate progenitor (IP) cells. Upon clustering, these cells appear to be highly homogeneous without being separable into, for example, cells in cell cycle phases. Specifically, they could not be discriminated, using hierarchical clustering with significance quantification of clusters (Gao et al, [127]2022) on the original (unbinarised) expression space, beyond two significantly different ( [MATH: p< :MATH] 0.05 Bonferroni corrected, 'Methods') and robust clusters of 6485 and 5465 cells, respectively (Appendix Fig. S[128]6). By contrast, Stator predicted 25 States (Fig. [129]2A; Datasets EV[130]1 and EV[131]2), with the optimal Dice dissimilarity of 0.95. The majority (75.7%; [MATH: N=9044 :MATH] ) of cells occupy one or more state, and 34.7% (4151) of cells are unique to a single state (Appendix Fig. S[132]2A,[133]B). Some states (e.g., #2) are localised in a PCA embedding, but most are not (e.g., #8 and #11). d-tuples in 7 states contained known cell cycle marker genes (Fischer et al, [134]2016; Tirosh et al, [135]2016): in the largest (#11; 2168 cells), nearly all d-tuples contained one or more known G1/S-phase markers’ genes (33 of 36 d-tuples; 92%): 23 d-tuples contained either Gins2 or Gmnn or both (20, 14 or 11 d-tuples, respectively). For illustration, one d-tuple contains three known S-phase expressed genes (Dnmt1, Hells, Pcna; (Giotti et al, [136]2018)) with their coordinated expression (i.e., 1-values) in 258 cells, which corresponds to [MATH: >6.5 :MATH] -fold deviation (default is set at eightfold) from the null hypothesis of independent expression (FDR [MATH: <0.01 :MATH] , default is set at 0.05); 35 other d-tuples co-occur sufficiently with this d-tuple in these cells to be combined into this single Stator state (#11). To assess biological validity of these Stator predictions—whether they might indicate cell types, subtypes or states—we undertook differential gene expression analysis (Datasets EV[137]3, EV[138]4, and EV[139]5). The 7 states’ s2o-DEGs were predominantly cell cycle markers. Specifically, of 45 genes that were among the 10 most significantly differentially expressed in 1 or more of the set of 7 s2o-DEGs, 40 (89%) were G1/S or G2/M stage marker genes according to (Tirosh et al, [140]2016) (Table S[141]5) and (Fischer et al, [142]2016) (Table S[143]10), confirming them as cell cycle states. Many s2s-DEGs were also cell cycle marker genes (Fig. [144]2B). Pairs of Stator states with s2s-DEGs are transcriptomically non-identical, even if they show some transcriptomic similarity, as expected for states located along a continuum. Note that pairs of states are concluded to be transcriptomically non-identical when they have significant s2s-DEGs (beyond the Stator state-defining d-tuple genes). In the second largest prediction (state #9; 2145 cells), all 34 d-tuples contained G2/M phase marker genes (Fischer et al, [145]2016; Giotti et al, [146]2018; Tirosh et al, [147]2016): 23 contained Pbk, 17 contained Cenpa, and 9 contained both. Stator predicted these states as G1/S (#11) and G2/M phases (#9), respectively, by their cells’ transcriptomes differing by 429 and 249 s2s-DEGs Fig. [148]2B) including for #11: G1/S phases’ marker genes (e.g., Tuba1b, Rpa2, Mcm4, Tipin, Mcm2, Hat1, Rfc3 and Rfc2); and, for #9: G2/M phases’ marker genes (e.g., H2afv, Arl6ip1, Stmn1, Ccdc34, Tacc3, Racgap1, Hmgb3, Calm3, and Cenpe), all genes that did not contribute to d-tuple definition. As expected, cells co-labelled with both states #9 and #11 preferentially expressed G1/S marker genes (Pclaf, Mcm6, Gins2 and Gmnn, for example) or G2/M markers (Pbk, Cenpa, Ccnb2 and Cdca3, for example) compared with cells only labelled with state #9 or with #11, respectively. Stator thus not only identifies cells that are cycling, but further differentiates cells into G1/S versus G2/M cell cycle phases. This justified labelling our method’s predictions as “Stator states”. Applying the same approach (comparing Stator states’ expressed d-tuple genes and s2s-DEGs with known cell cycle phase marker genes) identified five less-populated states (#3-4, #6–8) as additional cell cycle states, each transcriptionally non-equivalent with respect to states #11 (G1/S phases) and/or #9 (G2/M phases) and to each other, Fig. [149]2B. These five states’ s2s-DEGs again included marker genes for G1/S phases (states #3-4) or G2/M phases (states #6–8) relative to #11 (G1/S) and/or #9 (G2/M). In particular, 2 G1/S cell cycle phases’ marker genes (Hells, Mcm5) are significantly more highly expressed in cells in state #4 over #11, and indeed in states #3, 6, 9, 7 and 8; similarly, Gins2 has higher expression in #11 than in #3, 6, 9 and 8, Fig. [150]2B. Demanding that at least 3 s2o-DEGs are known markers of an annotation (Datasets EV[151]2–[152]4), we labelled the other states as either Intermediate progenitor cells (IPC) (Ruan et al, [153]2021), radial glial cell-like cells (RGC-like) (Zheng et al, [154]2022a), or astrocyte progenitor cells (APC) (Liu et al, [155]2022); or in the metaphase/anaphase of the cell cycle (significant enrichment of [156]GO:0045841 (Ashburner et al, [157]2000), FDR [MATH: <0.05 :MATH] ) or apoptosing or activated cells (expressing mitochondrial genome genes or intermediate early genes or activation markers (Lacar et al, [158]2016), respectively); or blood cell contaminants that highly expressed not just globin genes (Biagioli et al, [159]2009) but also Alas2, an erythroid-specific gene (Fig. [160]2). More specifically, from differential expression of s2s-DEGs Sparc and Sparcl1 (Dataset EV[161]6), states #12 and #13 appear to label APC1 and APC2, two known astrocyte progenitor cell types (Liu et al, [162]2022), and state #21 is associated with higher expression of truncated radial glial cell markers (Anxa2, Cryab, and Tmem47 (Yang et al, [163]2022)) relative to APC1 cells (state #12). We illustrate raw gene expression differences defining states #18 (Cilia) and #22 (Metaphase/Anaphase) in Fig. [164]2C,D. In Fig. [165]2E, we show how expression of the few established markers of precursor and intermediate cell states (Akdemir et al, [166]2020; Götz et al, [167]2015) varies across the 25 Stator states. Stator was also applied to a second subset ( [MATH: N=11,950 :MATH] ) of the E18 RPs, independent of the first, replicating APC1, APC2, IPC and RGC-like states, multiple G1/S and G2/M cell cycle phases’ states, and activated and blood contamination states (Fig. [168]EV2; Datasets EV[169]7, EV[170]8, and EV[171]9; Appendix Fig. S[172]2B). For more details, see “Stator state projection to disjoint data” and Fig. [173]EV2C,D for reproducibility of Stator states for RPs (and Fig. [174]EV5C,D for the neurons dataset). Figure EV2. Expanded view for Fig. [175]2. [176]Figure EV2 [177]Open in a new tab Disjoint RP dataset tested for reproducibility. (A) Stator identifies 27 states for a disjoint set of [MATH: N=11,950 :MATH] embryonic radial glial precursor-like cells (set RP2) at maximum modularity, annotated on the basis of d-tuple and s2o-DE genes as markers for cell types or cell cycle phases. (B) Dot plot illustrating differential expression of astrocytogenesis marker genes across all 27 Stator states. The size of the dots represents the - [MATH: log10(Seuratp-val-adj) :MATH] from differential gene expression testing between a state and all other states. Colour intensity reflects the [MATH: log2 (FC) :MATH] of gene expression. (C) The x axis shows the states obtained by running Stator on the RP2 dataset, while the y axis shows states obtained by running Stator on the dataset from Fig. [178]2 (set RP1). The RP1 states, obtained by running Stator on RP1 cells, are projected onto RP2 cells. The enrichment of these projected RP1 states in RP2 Stator states is computed using a hypergeometric test followed by the BH procedure to control the FDR at [MATH: 5% :MATH] . (D) As (C), but for RP1 cells. The y axis shows the RP1 states, while the x axis shows states obtained by running Stator on RP2 cells. The RP2 states are projected onto RP1 cells, and their enrichment in each RP1 cells’ state is computed using a hypergeometric test followed by the BH procedure to control the FDR at [MATH: 5% :MATH] . These panels demonstrate the reproducibility of Stator states on two disjoint sets of cells in the same biological condition. Figure EV5. Expanded view for Fig. [179]4. [180]Figure EV5 [181]Open in a new tab A second dataset of developmental neurons testing for reproducibility. Disjoint neurons dataset for reproducibility. (A) Stator identifies 29 states for a disjoint set of [MATH: N=19,000 :MATH] developmental neurons (set N2) at a Dice dissimilarity of 0.94, annotated when d-tuple and/or s2o-DEGs are marker genes for cell (sub)types or states. (B) Dot plot illustrating differential expression of neurogenesis marker genes across all Stator states. The size of the dots represents the - [MATH: log10(Seuratp-val-adj) :MATH] from differential expression testing between a state and all other states. Colour intensity represents the [MATH: log2 (FC) :MATH] of gene expression. (C) The x axis shows the states obtained by running Stator on the Fig. [182]4 dataset (set N1), while the y axis shows states obtained by running Stator on the dataset in panel A (set N2). The N1 states, obtained by running Stator on N1 cells, are projected onto N2 cells. The enrichment of these projected states in N2 Stator states is computed using a hypergeometric test followed by the BH procedure to control the FDR at [MATH: 5% :MATH] . (D) As (C), but for N1 cells. The y axis shows the N2 states, while the x axis shows states obtained by running Stator on N1 cells. The N2 states are projected onto N1 cells, and their enrichment in each N1 cells’ state is computed using a hypergeometric test followed by the BH procedure to control the FDR at [MATH: 5% :MATH] . These panels demonstrate reproducibility of Stator states on two disjoint sets of cells in the same biological condition. In addition to clustering, we compared Stator states for the mouse embryonic RPs with those obtained by three other methods, NMF, LDVAE and LDA; see “Comparison with other methods” for details. In summary, these methods consistently replicate multiple Stator states. However, there is greater expression specificity for Stator states over NMF, LDVAE or LDA states/modules, i.e., there is higher relative expression of known gene markers for the cell state as defined by Stator than the equivalent cell state defined by NMF, LDVAE or LDA (Table [183]1). Furthermore, these other methods lack uncertainty quantification for the reported gene modules, which can result in reported modules not being biologically identifiable (e.g., NMF results on embryonic RPs, Fig. [184]7A). Had these methods benefited from uncertainty quantification and FDR control, similar to Stator, then some reported modules may then be “statistically zero” which would avoid false positives and over-interpretation of results. We also compared Stator’s cell cycle states to Tricycle (Zheng et al, [185]2022b) analysis of this data (see Fig. [186]8 in “Comparison with other methods”). Table 1. The percentage of genes for each cell state label that show greater expression fold change in Stator over NMF, LDVAE and LDA, respectively. Stator vs NMF Stator vs LDVAE Stator vs LDA Activated 27% (18/67) 66% (44/67) 48% (32/67) Contamination 100% (4/4) 100% (4/4) 100% (4/4) Cilia 67% (2/3) 100%^a 100% (3/3) G1/S 86% (30/35) 97% (34/35) 97% (34/35) G2/M 57% (21/37) 78% (29/37) 92% (34/37) Neuron 75% (9/12) 92% (11/12) 100%^a RGC-like 31% (31/101) 100%^a 62% (63/101) APC2 0%(0/4) 50% (2/4) 100%^a [187]Open in a new tab ^aIndicates the state is not detected by LDVAE/LDA. Values more than 50% are in bold. Figure 7. Comparison with other methods. [188]Figure 7 [189]Open in a new tab (A) Heatmap presenting the -log10FDR from the hypergeometric test comparing the co-labelling of cells by Stator states and NMF states. To biologically annotate NMF states, marker genes were taken from 4 sources of literature: Dataset EV[190]2; IEGs ((Wu et al, [191]2017), as previously); cell cycle ((Tirosh et al, [192]2016), as previously); RGC-like marker genes at E17.5 ((Yuzwa et al, [193]2017) as previously). The biological labelling of NMF states has been performed exactly as before for Stator, although now more generously using 2 (rather than 3) or more marker genes for non-Stator methods. NMF recapitulates many of the Stator states. (B) For ciliated cells, Stator identified four ciliated cell subtypes (#18 ( [MATH: n=410 :MATH] ), #19 ( [MATH: n=25 :MATH] ), #20 ( [MATH: n=16 :MATH] ), #23 ( [MATH: n=25 :MATH] )), among which one (#18) shows higher expression of Foxj1, an ependymal cell marker gene. These four subtypes were not distinguished by NMF. In the boxplot, the median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts, and the box itself indicates the range in which the middle 50% of all values lie. (C–E) Scatter plot of genes’ differential expression for NMF, LDVAE and LDA vs Stator states. Y axis: genes’ log2-fold expression change between cells in a Stator state and all other cells; X axis: genes’ log2-fold expression change between cells in a state or type identified by NMF/LDVAE/LDA and all other cells. Gene symbol colour reflects cell state or type, as annotated using literature marker genes (as above). Genes’ differential expression is highly significantly correlated between Stator and (C) NMF, (D) LDVAE or (E) LDA. However, Stator can detect cell states or types with mostly higher differential expression of known marker genes. In general, there is higher specificity for Stator in most categories over NMF, LDVAE and LDA (Table [194]1). Figure 8. Comparison of cell cycle states with other methods. [195]Figure 8 [196]Open in a new tab (A) Cell cycle phases predicted by Tricycle (left) and Seurat (right) for the embryonic RP dataset. Seurat requires all cells to belong to one of the three phases shown and appears to have lower resolution than Tricycle. (B) Stator’s 7 states recapitulate Tricycle’s 5 cell cycle phase predictions. Heatmap showing the -log10FDR value of the hypergeometric test comparing the overlap of cells from cell cycle relevant Stator states with cells from cell cycle phase identified by Tricycle. Cell cycle states in embryonic neurons and RPs We next showed that Stator can also identify cells in G1/S or G2/M phases within an admixture of two cell types, neurons and RPs ( [MATH: n=13,605 :MATH] and [MATH: 5395 :MATH] ), from a single E18 mouse brain (10XGenomics, [197]2017) ('Methods'). In all, Stator predicted 110 states from these combined cells (Datasets EV[198]10 and EV[199]2), of which 34 were specific ( [MATH: ≥99 :MATH] %) to neurons, and 19 to RPs. The remaining 57—common to both neurons and RPs—annotate cells that are dispersed in whole transcriptome space. The median number of predicted states for a cell was 3 (Appendix Fig. S[200]2C). Stator does not rely on Euclidean distances, and thus does not require the first-stage dimensionality reduction to [MATH: ≤50 :MATH] dimensions to avoid the curse of dimensionality, nor does it require the 2D visualisation in the second step. Moreover, because Stator does not rely on the proximity of cells in expression space, it permits different sub-populations to co-exist in the same biological state. For example, cells of different type, here e.g., RPs and early neurons, can exist in the same biological state, e.g., G2/M phases of the cell cycle. If the proximity of cells in expression space is influenced most by cell type, then states attained by multiple cell types will often be missed. For example, Fig. [201]3A presents a heterogeneous dataset containing a combination of neurons (predominantly left) and RPs (right), for which PC1 explains [MATH: ~10% :MATH] of the total variation, with the remaining PCs explaining [MATH: <2% :MATH] each. Yet, each of these cell types clearly includes some cells occupying the same cell cycle state, e.g., G2/M. Stator readily detects such states from homogeneous cells or the combination of two cell types. Figure 3. Stator identifies states present in two different cell types. [202]Figure 3 [203]Open in a new tab (A) Stator states labelling both developmental neurons and radial glial cell precursors (RPs) from an E18 mouse brain dataset (10XGenomics, [204]2017). Stator identifies 12 states that include one or more cell cycle phase gene markers as d-tuple genes, and that delocalise throughout expression space. PC1 explains [MATH: ~10% :MATH] of the variation. State #51’s d-tuple genes (Lig1+, Mcm3+, Smc2+) encode proteins active in S phase, while the remaining 11 express known markers of G2/M phase (Giotti et al, [205]2018; Riba et al, [206]2022; Tirosh et al, [207]2016). It is notable that many G2/M states are defined by an absence of expression of d-tuple genes that are nevertheless known G2/M marker genes (Cenpa, Cks2, Cenpf, Racgap1, Cdc25c, Ube2c and Cdca8). (B) Left: Externally-derived cell cycle annotations of a mouse brain dataset sourced from five different experiments (Schmitz et al, [208]2022), with cells from the 10XGenomics E18 mouse dataset [209]GSE93421 removed, mostly separate along principal components 1 and 4. Right: Cells and embedding as left, but marked by the expression of all but one of the marker genes. Different intensities of blue represent ‘densities’ of cells in the 2D embedding. Note that since a two-dimensional PCA embedding can distort distances, densities cannot be directly interpreted, so no legend or axis is shown. Each state contains the expression of known cell cycle markers as well as a single non-expressed cell cycle marker gene (indicated above each box), predicted to be a combinatorial marker by Stator in RPs and/or neurons. These gene combinations thus demarcate cell cycle sub-phases, and a suggested ordering is shown here. Among 12 cell cycle Stator predictions were G1/S (#51), S/G2 (#55), early G2/M (#58) and late G2/M (#59) states (Figs. [210]3 and [211]EV3; Datasets EV[212]10–[213]12). RP cells in G1/S cell cycle phases (States #51 and #55) had not previously been detected in this embryonic stage by cell cycle classification (Yuzwa et al, [214]2017). An additional state (#57) involving cells that were predominantly labelled neurons (86%) showed multiple s2s-DEG markers for both newborn neurons (e.g., Dcx, Tubb3, Gad2, Stmn2) and G2/M phases (e.g., Cenpa and Cdca3) and hence is likely a post-G0 phase neuron state (Datasets EV[215]10–[216]12). Figure EV3. Expanded view for Fig. [217]3. [218]Figure EV3 [219]Open in a new tab s2s-DEG analysis for 5 cell cycle states. Numbers of cells labelled with any one of 5 cell cycle states (#57, 51, 55, 58, and 59) in embryonic RPs and neurons; areas of circles are proportional to their number (see legend). Filled circles indicate numbers of cells labelled with only of these single cell cycle states. Grey circles’ areas indicate numbers of cells labelled with two cell cycle states, those indicated by lines. Numbers of significantly differentially expressed genes between cell cycle state pairs (i.e., s2s-DEGs) are provided between the two states being compared; their colours refer to the state showing higher expression. For clarity, state pairs with [MATH: ≥ :MATH] 25 cells are shown. DEGs between any two states, including state pairs with fewer than 25 co-labelled cells, are provided in Dataset EV[220]12. Appendix Fig. S[221]8B additionally provides the number of co-labelled cells between any two states. Right: s2s-DEGs are indicated by “ [MATH: > :MATH] “ or “ [MATH: < :MATH] “ symbols; for example, Hells mRNA expression is significantly higher in State #51 over States #57, 55, 58 and 59. Early/late G1/S or G2/M cell cycle phase labels (top) were assigned using these mRNAs’ cell cycle phases known from high-throughput (top right; (Giotti et al, [222]2018)) and targeted experiments (Ung mRNA in late G1/S (Slupphaug et al, [223]1991) and Cenpa in G2 (Shelby et al, [224]1997)). Eight G2/M states contained minus gene markers (i.e., those without expression evidence) that are nonetheless known markers of G2/M phases: Cenpa for States #1 and #2, Cks2 #53 and #54, Cenpf #56, Racgap1 #58, Ube2c #60 or Cdca8 #61. We found similar combinatorial markers for the cell cycle in RPs, which included S-phase states without expression of one of three S-phase markers, Ung, Mcm3, Gins2. To investigate whether these states demarcate portions (sub-phases) of G2/M cell cycle phases, we highlighted cells from an external dataset along a cell cycle projection that expressed all but one G2/M or S-phase cell cycle marker genes, specifically the minus gene marker (Figs. [225]3 and [226]EV3). This illustrated that cells in these Stator states differentially occupied parts of the cell cycle continuum, consistent with cell cycle sub-phases. For example, Stator states differentiated between early G2/M (d-tuples with an absence of Ube2c or Cenpa expression, i.e., Ube2c-, or Cenpa-), early- or mid-G2/M (Cdc25c- or Racgap1-), or mid-to-late G2/M (Cks2- or Cdca8- or Cenpf-) or early S-phase (Gins2-), mid-S-phase (Mcm3-) and late S-phase (Ung-) (Fig. [227]3B). Rather than single genes, it is the combinatorial gene expression pattern that provides high resolution of cell states. This is because populations of cells defined only by the expression of various combinations of cell cycle marker genes, without requiring that the minus gene is unexpressed, are not localised to a cell cycle (sub)-phase (Fig. [228]EV4). Figure EV4. Expanded view for Fig. [229]3. [230]Figure EV4 [231]Open in a new tab Minus gene expression is required to specify cell cycle sub-phases. Left: External cell cycle annotations of a merged mouse brain dataset sourced from five different experiments (Schmitz et al, [232]2022), with the cells from dataset [233]GSE93421 removed. Right: In contrast to Fig. [234]3B, highlighting cells (in blue) based only on the expression of all but one of the marker genes, leaving the gene indicated above each plot (the ‘minus’ gene) unrestricted, results in more diffuse cell cycle specificity, and simply marks cells in G1/S or G2/M phases. Having successfully identified cell cycle sub-phases for RPs and for a combined RP and neuron dataset, we next used Stator to identify additional cell states within the combined dataset. Embryonic RPs were previously described as homogeneous at E17.5 (Yuzwa et al, [235]2017). By contrast, 47 Stator states could be labelled as RPs either because their d-tuple genes were embryonic RP markers (Yuzwa et al, [236]2017) or else they significantly more highly expressed such genes over all other states (Datasets EV[237]10, [238]EV11 and EV[239]2). This number of statistically distinct RP states is an order of magnitude greater than predicted in a previous study of neurogenesis (Shin et al, [240]2015). Of these 47 RP states, 21 were transcriptionally heterogeneous owing to their d-tuples including a minus gene marker, such as Hes5 (states #26-27), Qk (#29), and Pax6 (#34), each of which is involved in neural progenitor cell fate choice ((Ericson et al, [241]1997); Imayoshi and Kageyama, [242]2014; Takeuchi et al, [243]2020). These RP states were transcriptionally heterogeneous (Appendix Fig. S[244]9; Dataset EV[245]13): (i) 13 RP states yielded large number of s2s-DEGs, compared with the most populous RP state (#44); (ii) 3 states (#13, #36 and #39) showed significantly lower expression of 7 core RP genes, Mt3, Phgdh, Slc1a3, Ddah1, Aldoc, Vim, and Fabp7 (Yuzwa et al, [246]2017) than state #44; (iii) 15 states contained G2/M cell cycle phases’ marker genes among their s2s-DEGs relative to state #40; (iv) and 15 states yielded large ribosomal subunit genes as s2s-DEGs with state #40, a transcriptional signature of embryonic RP reactivation to become activated neural stem cells (Borrett et al, [247]2022; Dulken et al, [248]2017). Thirty-four RP states had neuronal marker genes among their s2s-DEGs with states #40 or #44 (Datasets EV[249]12 and EV[250]13), consistent with these embryonic RPs having a future neuronal fate. Seventeen states co-express Ascl1 and Neurog2 (often with Gadd45g, a transcriptional target of ASCL1), two genes that are expressed in more mature cells in a mutually exclusive manner (Parras et al, [251]2002). These states thus likely label early neural progenitor cells that have yet to attain their GABAergic (Ascl1) or glutamatergic (Neurog2) neuronal fate in the forebrain. Finally, we projected Stator RNA states from the E18 merged RPs and neuron dataset, into an independent scRNA-seq dataset (10XGenomics, [252]2021) of 5000 cells (3343 cells after quality control) acquired in the same biological condition that has an additional modality, namely scATAC-seq, to investigate the heterogeneity of states using an orthogonal mode of data in a disjoint dataset. Transcriptomic heterogeneity was retained across the two datasets and was additionally recapitulated by open chromatin status (see Supplementary section “Comparison with multimodal data” Appendix Fig. S[253]10). Neuronal states For our final analysis of embryonic mouse brain cells, we analysed two disjoint subsets each containing 19,000 mouse E18 neurons. As the modularity was maximised at Dice dissimilarity of 0.97 and 0.91, respectively, we applied a mean similarity of 0.94, resulting in 29 states in each (Datasets EV[254]14 and EV[255]15), allowing us to compare the disjoint subsets at equivalent resolution. The number of predicted states per cell is presented in Appendix Fig. S[256]2D,[257]E. This number of Stator states was five-fold more than the four pairwise significantly distinct clusters found by hierarchical clustering in expression space for the first disjoint dataset (Appendix Fig. S[258]11). Stator successfully distinguished striatal medium spiny neurons (MSN) from interneurons by expression of known marker genes (e.g., Ngef, Nrxn1, Pou3f1, Tshz2 (Arlotta et al, [259]2008; Fuccillo et al, [260]2015; Su-Feher et al, [261]2022), versus Arx, Epha5, Lhx6, Prox1 (Li et al, [262]2022; Miyoshi et al, [263]2015; Poirier et al, [264]2004), Fig. [265]4A; Datasets EV[266]2, EV[267]16 and [268]17). It further separated MSNs into their two known subtypes, Direct or Indirect pathway cells (Cirnaru et al, [269]2021; Cui et al, [270]2013) via markers: Direct: Ebf1, Foxp1, Isl1, Nrxn1, Zfhx3 and Zfp503 (Fuccillo et al, [271]2015; Li et al, [272]2022; Precious et al, [273]2016; Shang et al, [274]2022; Zhang et al, [275]2019) and Indirect: Adora2, Ebf1, Gucy1a3 and Gucy1b3 (Li et al, [276]2022) (Fig. [277]4B), and separated interneurons into Htr3a and/or Npy expressing subtypes (Tremblay et al, [278]2016) (Datasets EV[279]2, EV[280]16 and [281]17). Three RP-like states were additionally detected (Fig. [282]4C). States could be further labelled as early or late via markers of neuronal maturation (Rubenstein et al, [283]2020), specifically the temporal sequence of expression of Dlx2, Dlx1, Dlx6os1 and Dlx6 genes, and the later expression of MSN or interneuron markers (Fig. [284]4D) (Liu et al, [285]1997). Figure 4. Stator identifies states for developmental neurons. [286]Figure 4 [287]Open in a new tab (A) Stator states for [MATH: 19,000 :MATH] E18 mouse cells, previously annotated as neurons. States were labelled by matching s2s- and s2o-DEGs with literature gene markers, as before (Fig. [288]2). The barplot colours indicate proportion of cells captured by each cluster following hierarchical clustering of cells (see Appendix Fig. S[289]11), resulting in four significantly different clusters. On the right-hand side of panel A, three representative Stator states (# 8, 13, 21) are highlighted in a PCA embedding of the unbinarised expression data, and annotated with their total number of cells and d-tuples, as well as the five most common individual gene states across the states’ d-tuples. IC intercalated cells of amygdala (Erbb4+, Tshz1+, Foxp2+, Pbx3+ (Kuerbitz et al, [290]2018; Peters et al, [291]2023), IN interneurons, Late late-born neurons, MNP migratory neuronal precursors (Vax1+, Shtn1+, Pcdh9+, Tiam2+ (Asahina et al, [292]2012; Coré et al, [293]2020; Kawauchi et al, [294]2003; Sapir et al, [295]2013), MSN medium spiny neurons, NPC neural precursor cells, RP radial glial cell precursors. (B) Heatmap of expression level (z-score) for genes upregulated in state #13: Mid Direct MSN, versus other states, for cells in state #13 and a random selection of [MATH: n=500 :MATH] cells from other groups. (C) Heatmap of expression level (z-score) for genes upregulated in state #24, versus other states, for cells in state #24 and a random selection of [MATH: n=500 :MATH] cells from other groups. (D) Dot plot illustrating differential expression of neurogenesis marker genes (Rubenstein et al, [296]2020) across all Stator states. The size of the dots represents the - [MATH: log10(Seuratp-val-adj) :MATH] from differential expression testing between a state and all other states. Colour intensity represents the [MATH: log2 (FC) :MATH] of gene expression. Increasing the resolution of Stator state identification can resolve multiple constituent biological states. At a Dice dissimilarity of 0.94, Stator’s state #26 labelled neural precursor cells, as evidenced by high expression of Zeb2, Mdk, Ctnna1, Arx and Prox1. Nevertheless, this state was found to be a composite of three component sub-states largely following the branching order of co-occurring d-tuples (see Fig. [297]1D,[298]E; 'Methods'). From their d-tuple genes, these sub-states are readily distinguished as labelling G2/M cell cycle phases, neural stem cells and newborn neuronal precursors, respectively (Appendix Fig. S[299]12). Stator states representing the same neuronal subtypes (e.g., interneurons, direct or indirect MSNs and late-born neurons) for the second disjoint dataset are shown in Fig. [300]EV5 and Datasets EV[301]18 and EV[302]19. Stator resolves cell (sub)types in human liver disease at higher resolution To demonstrate application of Stator in a human disease context, we analysed [MATH: 20,000 :MATH] cells from patients with uninjured or cirrhotic livers. These cells had previously been annotated as one of 12 types (Ramachandran et al, [303]2019). Stator identified 53 states (Dataset EV[304]20), 28 that were differentially enriched between cirrhotic and uninjured liver sample cells (Fig. [305]5A). Enrichment of these states showed that Stator retrieved previous cell-type annotations, yet also found multiple states for each previous annotation (Fig. [306]5B). For example, cells previously annotated as being endothelial are uniquely enriched in 7 states (#4–6, #23, #32–34; green box in Fig. [307]5B). To cross-reference the same states in panels A and B we use an alluvial plot. Rather than calculating enrichments for disease status (panel A) or cell-type annotations (panel B) separately, Stator also can perform an enrichment analysis for cells with Stator state labels with both previous cell-type and disease/uninjured status annotations (panel C). This shows, for example, that whereas state #4 is enriched among cirrhotic sample cells (Fig. [308]5A) and among annotated endothelial cells (ECs) (Fig. [309]5B), it is enriched not just in cirrhotic but also uninjured ECs (Fig. [310]5C). Equivalents to Stator states #5, #33 and #34 were found by (Ramachandran et al, [311]2019) (i.e., Endo(2), Endo(7) and Endo(1)) and then validated by cell staining, flow cytometry and/or immunofluorescence. Figure 5. Stator states in cirrhotic and healthy human liver cells previously annotated by (Ramachandran et al, [312]2019). [313]Figure 5 [314]Open in a new tab (A) States (columns) enriched in single cells from cirrhotic or healthy liver samples (rows). (B) Heatmap showing states significantly enriched in these cells’ previous annotations (indicated by asterisks). Seven states (#4, #5, #6, #23, #32, #33, and #34) are significantly enriched only in the endothelial cell (EC) annotation (green box). This panel implies that Hepatocytes, Mesothelia and pDCs do not correspond to any Stator state that is exclusively enriched for these cell types. Nevertheless, this is due to the conservative thresholds applied here. Expected correspondences emerge when thresholds are further relaxed (Dice dissimilarity [MATH: >0.5 :MATH] , log2FC [MATH: >1 :MATH] ) where there are 6, 2 and 2 Stator states that are exclusively enriched for Hepatocyte, Mesothelia and pDC annotations, respectively (see Appendix Fig. S[315]13). (C) States significantly enriched in both cirrhotic/uninjured status and a previous cell-type annotation (indicated by asterisks). (D) Virtually all cells with previous EC annotations are labelled with just one of the 7 EC-specific cell states. These states were not detected by the original study (Ramachandran et al, [316]2019) or differential abundance analysis by Milo (Dann et al, [317]2022). (E) Numbers of cells labelled with EC states (#4, #5, #6, #23, #32, #33 and #34); areas of circles are proportional to their number (see panel legend). For states #5 to #34, numbers of significantly differentially expressed genes between cell cycle state pairs (i.e., s2s-DEGs) are indicated relative to state #4; colours refer to the state showing higher expression. Coloured numbers indicate significantly differentially expressed genes in cells labelled with state #4 compared to cells in any other EC state (i.e., s2o-DEGs); numbers of significantly differentially expressed genes between state #4 and all other EC states (increased and decreased expression) are shown in grey. Colour-coded marker genes used to annotate cell states are provided adjacent to each state’s circle; a red box contains three genes whose expression is decreased in state #4 relative to the other EC states. (F) Heatmap of expression levels (z-scores) for marker genes used to annotate each EC state. Genes are grouped and colour-coded by their associated annotation from the literature (Dataset EV[318]2). The five categories of gene markers are colour-coded as indicated in the panel legend. Cells (columns) are enclosed within a coloured box designating the EC state labelling that cell. None of the seven EC-labelled Stator states co-occur in five or more cells (Fig. [319]5D) suggesting that each of the seven represents a distinctive EC subtype. Cross-referencing these states’ s2s-DEGs (Dataset EV[320]21) to literature EC gene markers (Przysinda et al, [321]2020; Trimm and Red-Horse, [322]2023) identified state #33 as the ACKR1, WNT2, COL3A1 and COL6A2-expressing immunomodulatory subpopulation which is specific to the fibrotic niche (Ramachandran et al, [323]2019) (Fig. [324]5E). In addition, state #5 was identified as PDPN, FOXC2, and PROX2-expressing lymphatic-specific ECs; state #6 as a subpopulation expressing PLAT, whose protein level is increased in patients with liver disease (Leiper et al, [325]1994); and, states #23, #32 and #34 as liver-specific liver sinusoidal ECs (LSECs). The most populous state (#4) labels ECs that are not tissue or organ-specific (Dataset EV[326]22). Differential expression of these EC subtype marker genes across these ECs is illustrated in Fig. [327]5F. In summary, Stator has labelled cell subtypes from among a previously homogeneous set of ECs that were scarce in this dataset ( [MATH: < :MATH] 2.5%), demonstrating its identification of rare disease-specific subtypes. For comparison, NMF analysis was performed on the liver cirrhosis dataset from (Ramachandran et al, [328]2019) using the NMF procedure in (Barkley et al, [329]2022), described in the Methods section 'NMF procedure and gene modules'. Of the 25 NMF modules identified, 5 were significantly enriched in endothelial cells. Of these five modules, only one could be annotated based on [MATH: ≥2 :MATH] marker genes used in the original submission (Przysinda et al, [330]2020; Trimm and Red-Horse, [331]2023): module [MATH: mFCN3< /mn> :MATH] genes included CLEC4G and STAB2, both markers for liver sinusoidal endothelial cells. The other four NMF modules could only be labelled as endothelial cells of unknown type. By comparison, using the same data, seven Stator states could be labelled with [MATH: ≥3 :MATH] marker genes from (Przysinda et al, [332]2020; Trimm and Red-Horse, [333]2023) (Fig. [334]5E). Stator recapitulates cancer cell types and NMF state annotations, yet at a higher resolution Finally, we applied Stator to a cancer (hepatocellular carcinoma (HCC)) dataset. Stator’s cancer cell states were then compared against two sets of annotations that were defined previously in (Barkley et al, [335]2022) by (i) clustering (Stuart et al, [336]2019) and comparison against reference datasets using SingleR (Aran et al, [337]2019), or (ii) non-smooth, non-negative matrix factorisation (nsNMF; Pascual-Montano et al, [338]2006). For this analysis, 51 Stator states were predicted from 14,698 cells derived from 4 patients’ hepatocellular carcinoma (HCC) samples (Barkley et al, [339]2022) (Dataset EV[340]23). These states were enriched for 11 of 12 cell types previously annotated using clustering and SingleR (Barkley et al, [341]2022) (Fig. [342]6A); the exception, epithelial cells, were low in number ( [MATH: n=21 :MATH] ). As before, Stator resolved single-cell types into multiple subtypes, for example, a single B-cell annotation into 13 sub-states. Myeloid lineage (macrophages, dendritic cells [DC] and neutrophils) states and lymphoid lineage (T cells, natural killer [NK] cells and B cells) states were distinct, highlighted in Fig. [343]6A by blue and red boxes respectively. Stator states were often easily annotated by their d-tuple genes. For example, state #43’s d-tuple genes contained CD4 and other T-cell markers; the myeloid lineage state #32 [PLBD1+, SPI1+, LYZ+, MS4A6A+] is in part defined by MS4A6A, a known marker for neutrophils, macrophage and dendritic cells (Franzén et al, [344]2019a); and the lymphoid lineage state #48 [IGHG4+, IGKC+, FGFBP2+, IGHG1+] is largely defined by immunoglobulin genes, known markers for terminally differentiated B cells, i.e., plasma cells (MacParland et al, [345]2018). Figure 6. Stator identifies HCC cell types and states at higher resolution than other methods. [346]Figure 6 [347]Open in a new tab (A) Heatmap showing significant enrichment (asterisks) among 51 Stator states with 12 cell-type annotations previously defined by clustering followed by SingleR annotation (Aran et al, [348]2019; Barkley et al, [349]2022; Stuart et al, [350]2019). Stator identified multiple sub-populations for previously identified single-cell types; for example, 12 Stator states occur unusually often only among cells previously annotated as hepatocytes. (B) Since Stator allows for cells to acquire multiple states, hepatocyte states can co-label single cells. Nevertheless, most cells are labelled only as single Stator states. (C) Numbers of s2s-DEGs and their mean log2-fold change between Stator states enriched in cells previously annotated as hepatocytes. The 12 hepatocyte-enriched states are transcriptionally distinguishable. (D) Statistically significant enrichment (notified by asterisks) of Stator states (Y axis) in cells previously annotated (Barkley et al, [351]2022; Gaujoux and Seoighe, [352]2010; Puram et al, [353]2017) into 16 NMF-defined states (X axis). (E) Two Stator states are differentiated by genes that are predictive of liver cancer patient survival. Mean s2s-DEG expression fold change (X axis) for state #7 over #37 plotted against the percentage of 5-year survival (Y axis) for TCGA patients whose expression of this gene lies above a pre-determined threshold (Uhlén et al, [354]2017). The most populous state, #45, labels C1Q+ macrophages, an immunosuppressive population (Revel et al, [355]2022), annotated because 24 of 25 gene markers for these macrophages (cluster 10 of (Sharma et al, [356]2020)) are s2s-DEGs relative to state #40 (Dataset EV[357]24). These are tissue-resident, rather than tumour-associated, C1Q+ macrophages because state #45 cells significantly more highly express FOLR2, rather than TREM2, relative to state #40 (Revel et al, [358]2022). Twelve Stator states were enriched among cells labelled previously as hepatocytes (Barkley et al, [359]2022). These states labelled largely distinct sets of cells (Fig. [360]6B) that are transcriptionally distinguishable, as evidenced for example by large numbers of s2s-DEGs (Fig. [361]6C). A large minority (8–23%) of these states’ s2s-DEGs are not expressed in normal hepatocytes ('Methods'), thereby reflecting their transformed status. The 12 transformed hepatocyte states showed considerable cell cycle gene expression heterogeneity. For example, State #7 expressed 6 cell cycle genes (BIRC5, CCNA2, CCNB2, CDK1, TOP2A and UBE2C) significantly more highly than the most populous State #37 (Dataset EV[362]25). Other states (#17, 18, 19, 38) showed lower expression of these genes. These six cell cycle genes are rarely expressed in normal liver samples (Andrews et al, [363]2022) (Methods) and each gene’s high expression is known to be prognostic of worse outcome in liver cancer (Uhlén et al, [364]2017). In the previously published analysis, these HCC cells were annotated both by type and state (Barkley et al, [365]2022). The enrichment of Stator states in NMF states is presented in Fig. [366]6D. To investigate whether Stator could resolve these cells more finely, we analysed only those with both ‘Hepatocytes’ and ‘Cycle’ annotations, finding them to be enriched in 7 Stator states, most frequently in #37 (44.9% of 1447 cells) and/or #7 (40.8% of 1447 cells) (Fig. [367]EV6). Despite their previous identical annotation, cells in these 2 Stator states are transcriptionally divergent, with 78 s2s-DEGs separating them (Dataset EV[368]26). Cells in state #37 had increased expression of transcripts that are abundant in normal hepatocytes (34 of 34 s2s-DEGs e.g., AHSG, PLA2G2A, CYP2E1 and HPD) whereas cells in state #7 had increased expression of genes that are rarely or never expressed in normal hepatocytes (13 of 44 s2s-DEGs, e.g., TFF1, TFF2, TFF3 and NDUFA4L2). This suggests that Stator state #7 cells are in a more advanced state of cellular transformation than #37 cells. Figure EV6. Expanded view for Fig. [369]6. [370]Figure EV6 [371]Open in a new tab Stator states (columns) that are significantly enriched in HCC cells previously doubly-annotated by cell type (right) and cell state (left) inferred using singleR and NMF, respectively. To test this hypothesis, we used TCGA liver cancer prognosis data (Uhlén et al, [372]2015, [373]2017), plotting s2s-DEGs’ mean expression fold change (state #7 over #37, this study; Fig. [374]6E, X axis) against the 5-year percentage survival rate (Y axis) for TCGA patients whose expression of this gene is above a pre-determined threshold (Y axis, (Uhlén et al, [375]2017)). This showed that genes that are more highly expressed in cells in state #7 over #37 tend to be those genes that are more highly expressed, at diagnosis, in liver cancer samples of patients with lower survival rates. Conversely, genes that are more highly expressed in state #37 over #7 tend to be genes that are more highly expressed in liver samples of patients with higher survival rates. In summary, Stator has revealed previously unappreciated HCC cancer states whose differential expression involves genes that are predictive of patient survival. Discussion Single-cell transcriptomics is being translated into clinical practice for biomarkers of disease progression, patient stratification and antitumour treatment (Jia et al, [376]2022; Lim et al, [377]2023; Van de Sande et al, [378]2023). Concurrently, cell fate trajectories are being predicted for multiple cell types and subtypes across development (Imaz-Rosshandler et al, [379]2024). Nevertheless, virtually all such studies project high-dimensional single-cell transcriptome data into two or three dimensions, which distorts both clusters and developmental trajectories (Chari and Pachter, [380]2023). Cell localisation in expression space presents an additional problem: how to label a single cell with multiple labels (type, subtype, cell cycle phase, maturity and activity (Kotliar et al, [381]2019)) which are non-localised in expression space. Stator’s alternative approach circumvents cell clustering and this distortion by identifying 3–7 genes with unexpectedly coordinated expression (or non-expression) across single cells. Stator’s further advance is the identification of cell (sub)types and states at substantially higher resolution than existing methods. Stator results show that a wealth of biological information can be inferred from the higher-order statistics of single-cell expression data. Evidence exists for higher-order and combinatorial genetic interactions (Antebi et al, [382]2017; Arnosti et al, [383]1996; Kuzmin et al, [384]2018; Watkinson et al, [385]2009) and pairwise quantities at different pseudotimes have been investigated (Ghazanfar et al, [386]2020). Nevertheless, the biological value of higher-order statistics in single-cell gene expression has not previously been shown. The picture we have seen emerging from applying Stator is of cells adopting a spectrum of states (or colours, in this metaphor) with their primary colour representing their strongest transcriptomic signature, most often indicating cell type. Differential expression between cells of the same type, but in two different states, filters out their primary colour thereby revealing secondary colours, representing cellular dynamics differences. This metaphor can be continued with respect to tertiary and quaternary colours, representing even more finely resolved aspects of cell state. Finer resolution of cell state will enhance understanding of state transitions in cancer, in ageing and age-related diseases, and during development (Barkley et al, [387]2022; Griffiths et al, [388]2018; Traxler et al, [389]2023). Our analysis of HCC cells, for example, uncovered a cancer state predictive of patient survival (Fig. [390]6E). Further, our results more finely resolve neural stem and progenitor cell types that can now be investigated using mouse models. For example, 29 of 43 minus genes that partly define 110 neuronal and/or RP states (see above), have morphology or behaviour phenotypes when disrupted in mice (Dataset EV[391]27; Blake et al, [392]2020). The roles of these genes in specifying cell state transitions during neurogenesis and in neurological disease, can now be investigated at greater cellular and developmental resolution. Gao et al ([393]2022) recently solved the issue of selective inference bias, or double-dipping, specifically when cells are clustered by optimising their transcriptional differences before calculating their transcriptional differences. Each of these two operations occurs on gene expression space. Stator clusters not cells, but rather d-tuple gene signatures, prior to s2s-DEG analysis. Even if present, Stator will mitigate selective inference, at least in part, by differences not being maximised on the same space, and by demanding significant s2s-DEGs to not just be d-tuple genes, when states are declared to be transcriptomically non-identical. The Gao et al method is also not immediately applicable here due to its reliance on clustering algorithms that compute Euclidean distances, whereas Stator relies on Dice dissimilarity. Due to current computational constraints, Stator is limited to approximately [MATH: 1000 :MATH] HVG and [MATH: 40,000 :MATH] cells to estimate higher-order [MATH: n :MATH] -point interactions ( [MATH: n=2,3,…< mo>,7 :MATH] ). Estimation of conditional dependencies contributes most to computational cost, so Stator’s efficiency and accuracy could be greatly improved as new causal discovery methods are developed. In addition, accuracy could be improved by integrating biological knowledge into the dependency graphs. The limitation of up to 7-point interactions is statistical rather than computational: we did not find evidence for significant 7-point interactions in the datasets analysed. Stator takes advantage of sparse gene-by-cell matrices, and so is not intended for analysing deep coverage transcriptomes until more sophisticated binarisation schemes are explored (e.g., Li and Quon, [394]2019). Other challenges relate to how Stator predictions should be interpreted, particularly those states lying on a continuum whose biology is poorly understood. Further, the resolution (i.e., Dice dissimilarity) at which states should be defined and can be interpreted will vary by dataset. Lastly, conditioning on absent gene expression in the Markov blanket (Eq. ([395]1), 'Methods') may overlook some states despite large numbers of biologically plausible states being returned. Stator can be applied to a variety of scRNA-seq datasets in biomedicine, including those with a temporal label (e.g., developmental or disease progression), as well as data from different individuals, to compare and contrast cell states of individuals with different disease progression trajectories, or responders and non-responders to therapy. Finally, Stator’s general methodology can also be applied to other datasets with variables that are binary or can be approximated well by binarisation, such as disease comorbidities, scATAC-seq, or sparse single-cell proteomics. Methods Reagents and tools table Software Stator nextflow pipeline [396]https://github.com/AJnsm/Stator/tree/main pclag [397]https://cran.r-project.org/web/packages/pcalg/index.html BiDAG [398]https://cran.r-project.org/web/packages/BiDAG/index.html Scanpy [399]https://github.com/scverse/scanpy Stator R shiny app code [400]https://github.com/YuelinYao/MFIs Stator R shiny app server [401]https://shiny.igc.ed.ac.uk/MFIs/ Complete list of packages used within Stator R shiny app (with version number) [402]https://github.com/YuelinYao/MFIs/blob/main/renv.lock Seurat (v4.3.0) [403]https://github.com/satijalab/seurat/releases Clusterpval [404]https://lucylgao.github.io/clusterpval/ NMF [405]https://cran.r-project.org/web/packages/NMF/index.html scVI [406]https://github.com/YosefLab/scVI/ LDVAE [407]https://docs.scvi-tools.org/en/stable/user_guide/models/amortizedl da.html LDA [408]https://docs.scvi-tools.org/en/stable/user_guide/models/amortizedl da.html Milo [409]https://github.com/MarioniLab/miloR Tricycle [410]https://bioconductor.org/packages/release/bioc/html/tricycle.html [411]Open in a new tab n-point interaction estimation In previous work, we developed a model-independent estimator of higher-order interactions amongst binary variables (Beentjes and Khamseh, [412]2020). Here, we refer to the multiplicative interaction in (Beentjes and Khamseh, [413]2020) as Model-Free Interaction (MFI) due to its definition being without reference to any subjective parametric model, but in terms of probabilities and their expectation values. Similar notions (for 2-point interactions) have been proposed in the statistics literature (Hernan and Robins, [414]2023; VanderWeele and Knol, [415]2014). For completeness, we summarise the main definitions and interpretations of MFIs (Beentjes and Khamseh, [416]2020) below. A 2-point MFI is defined, and can be rewritten, as follows: [MATH: IG< mi>i,Gj=logp(Gi=1,Gj= 1∣G_=0)< /mo>p(Gi=0,Gj=0∣G_=0)< /mo>p(Gi=0,Gj=1∣G_=0)< /mo>p(Gi=1,Gj=0∣G_=0)< /mo>=logp(Gi=1,Gj= 1∣G_=0)< /mo>p(Gi=0,Gj=1∣G_=0)< /mo>−logp(Gi=1,Gj= 0∣G_=0)< /mo>p(Gi=0,Gj=0∣G_=0)< /mo>=logE[Gi∣G j=1,G_=0]< /mo>E[Gi∣G j=0,G_=0]< /mo>1−E[Gi∣G j=0,G_=0]< /mo>1−E[Gi∣G j=1,G_=0]< /mo>, :MATH] 1 where [MATH: G_ :MATH] is the set of all other genes, aside from [MATH: Gi :MATH] and [MATH: Gj :MATH] , that are not independent of [MATH: Gi :MATH] and [MATH: Gj :MATH] . The first line in Eq. ([417]1) has the interpretation of a generalised conditional log-odds ratio and is symmetric in [MATH: Gi :MATH] and [MATH: Gj :MATH] . The second line provides the following interpretation: “Does the likelihood of gene [MATH: Gi :MATH] ’s expression being on vs off depend on the status of gene [MATH: Gj :MATH] ’s expression?”. To elaborate further, the first term represents the likelihood of gene [MATH: Gi :MATH] being on vs off, whilst gene [MATH: Gj :MATH] is on, while the second term represented the same quantity with gene [MATH: Gj :MATH] is off. If the expression values of the two genes [MATH: Gi :MATH] and [MATH: Gj :MATH] are completely independent of each other, then these two terms cancel and result in a zero interaction as desired. The third line represents the same quantity in terms of expectation values, which are then taken as averages over the data for estimating the interactions. Uncertainties in these estimates are quantified via the bootstrap procedure (Efron, [418]1979). In (Beentjes and Khamseh, [419]2020) we generalised this definition and estimator to [MATH: n :MATH] -point interactions. For example, a 3-point interaction, where [MATH: p(G i,j,k=< mn>1,1,1) :MATH] is shorthand for [MATH: p(G i=1,Gj=1,Gk=1 ∣G_=0)< /mo> :MATH] and so on, is defined as follows: [MATH: IG< mi>i,Gj,Gk =logp(Gi,j,k=1,1, 1)p(Gi,j,k =1,0,0)p(< mrow>Gi,j,k=0,1, 0)p(Gi,j,< mi>k=0,0,1)p(< /mo>Gi,j< /mi>,k=1,1,0)p (Gi, j,k=1,< mn>0,1),p(Gi,j,k=0,1,1)p< /mi>(Gi ,j,k=0,0,0), :MATH] 2 and has the interpretation of whether the expression status of a third gene, [MATH: Gk :MATH] , changes the 2-point interaction between [MATH: Gi :MATH] and [MATH: Gj :MATH] expression. We presented previously (Beentjes and Khamseh, [420]2020) that this definition recovers, in a data-driven manner, known ground truth interactions in statistical physics systems such as the Ising model, and more generally energy-based models, as well as any other Markovian complex system. We further demonstrated that our MFI definition, used to directly estimate the interaction, results in the same estimate as when training a Restricted Boltzmann Machine, both analytically and numerically within statistics. The advantage of the MFI direct estimation on binary data is its model-independent definition interpretability and its avoidance of having to fit the joint probability distribution amongst the variables. The latter is a much more complex quantity to estimate robustly than the combination of expectation values in the MFI estimator. Finally, we note that conditioning on [MATH: G_=0 :MATH] in Eq. ([421]1) is equivalent to finding the ‘pure’ 2-point interaction between [MATH: Gi :MATH] and [MATH: Gj :MATH] without the influence of the other genes’ expression. Note that [MATH: G_ :MATH] need not contain the set of all other genes when estimating the interaction. Indeed, it is sufficient to only condition on the Markov blanket (MB) of [MATH: Gi :MATH] and [MATH: Gj :MATH] , i.e., the smallest set of genes [MATH: G_ :MATH] conditional on which [MATH: Gi :MATH] and [MATH: Gj :MATH] are independent of all other genes. Once conditioned on the MB, the information from other genes no longer influences the interaction between [MATH: Gi :MATH] and [MATH: Gj :MATH] . Therefore, restricting [MATH: G_ :MATH] to only contain the MB of genes for each pair [MATH: Gi :MATH] and [MATH: Gj :MATH] , improves statistical power, whilst simultaneously ensures that the 2-point interaction remains stable by measuring the direct dependence between [MATH: Gi :MATH] and [MATH: Gj :MATH] , rather than indirect correlations. The same argument holds for higher-order interactions. Figure [422]EV1 presents a comparison between MFIs, correlation, partial correlation and mututal information, computed on data generated from a set of DAGs in accordance to Appendix Fig. S[423]1, first presented in (Jansma, [424]2023a). The set of MFIs is distinct between distinct DAGs, whereas other dependence metrics are only able to distinguish some, but not all, distinct DAGs. Currently, performing conditional independence tests amongst all groups of genes to determine their MBs, is statistically and computationally prohibitive. For this reason, we restrict the estimation of [MATH: n :MATH] -point interactions to the top [MATH: 1000 :MATH] HVGs, after quality control. Stator then infers the MBs of the HVGs via a hybrid Bayesian network inference technique (Kuipers et al, [425]2022) which sequentially performs (conditional) independence testing, starting from a fully connected undirected graph of genes (Peter–Clark algorithm (Spirtes et al, [426]2001)), followed by a score and search MCMC approach to obtain the optimal completed partially directed acyclic graph (CPDAG), introduced in (Kuipers et al, [427]2022). We emphasise that we do not claim any causal inference or regulatory relationships amongst these genes based on the inferred network. Instead, we utilise this algorithm to infer a gene signature dependence network structure to obtain the MB and estimate higher-order interactions with sufficient statistical power, with the final aim of inferring cell (sub)types and states. Inferring this dependence network massively reduces the search space for potentially significant interactions. For run-time considerations, see Supplementary Material A.1. Finally, we note that MFIs are symmetric. Therefore, when estimating, e.g., a 2-point interaction using line 3 in Eq. ([428]1), one can choose to estimate the terms [MATH: E[Gi∣G j=1,GiMB] :MATH] or [MATH: E[Gj∣G i=1,GjMB] :MATH] , whichever results in the greatest statistical power, i.e., when either the MB of [MATH: Gi :MATH] or [MATH: Gj :MATH] is smaller, or more generally, when the MB of [MATH: Gi :MATH] or [MATH: Gj :MATH] is more populated. Having identified the set of MBs, Stator then estimates up to 7-point interactions amongst the genes in the expression data. The 2-point interactions are estimated between all pairs of genes, the 3-, 4- and 5-point interactions are estimated amongst all gene tuples that are in each other’s MB (the interaction amongst Markov disconnected genes vanishes (Jansma, [429]2023a)), and 6- and 7-point interactions are calculated amongst genes that are in the MB of a tuple of genes with a significant 5- or 6-point interaction. Every interaction is estimated using the smallest possible MB. In order to prioritise candidate interactions for the next step (“Deviating gene tuples (d-tuples)”), each interaction is estimated [MATH: 1000 :MATH] times by bootstrap resampling the data. An interaction is prioritised as a ‘non-zero’ candidate for the next step if the fraction [MATH: λ :MATH] of bootstrap estimated interactions with a different sign from the original estimate is less than [MATH: 0.05 :MATH] . This procedure is more permissive than testing for the hypothesis that the 95% two-sided percentile bootstrap confidence interval does not contain zero. For the datasets studied in this work, we verify numerically that this procedure is equivalent to demanding [MATH: 90−95% :MATH] confidence, depending on the order of the interaction. Deviating gene tuples (d-tuples) In a finite sample of [MATH: N :MATH] cells, the observed frequency [MATH: Φs :MATH] of a tuple [MATH: s={s 1,…,sn} :MATH] of [MATH: n :MATH] independently expressed binarised genes is binomially distributed as: [MATH: P(Φs=k)=Nkπsk(1−πs)N−k,whereπs=∏i=1 nsiμi+(1−si)( 1−μi ), :MATH] 3 and [MATH: μi :MATH] is the mean expression of gene [MATH: i :MATH] across all cells under consideration (i.e., the cells for which the relevant MB is zero). Equation ([430]3) describes the null hypothesis that the observed cell counts are the result of independently expressed genes, and gives the expected number of cells under this null: [MATH: E[Φs ]=πsN :MATH] . An observation [MATH: Φs=ϕs :MATH] of one of the [MATH: 2n :MATH] joint states of [MATH: n :MATH] genes can be assigned a p value: [MATH: p=1−∑k=0 ϕs−< /mo>1P(Φs=k), :MATH] 4 and log twofold change, or deviation: [MATH: Log2FC=log2ϕsπsN∈(−∞,∞ ). :MATH] 5 The p values are calculated for all tuples with a positive Log2FC, and corrected for multiple hypothesis testing with the Benjamini–Yekutieli procedure (Benjamini and Yekutieli, [431]2001). A non-zero interaction can thus have one or more deviating tuples (d-tuples), those tuples of genes that significantly deviate from the null hypothesis. Since a non-zero interaction reflects a higher-order dependency in the data, its d-tuple describes the gene expression patterns that are (at least in part) responsible for this dependency. The set of cells that have the [MATH: n :MATH] genes in that particular expression state—ignoring the state of the MB—form the associated set of cells. Note that cells carrying a certain combination of d-tuples need not cluster in expression space: whilst these cells all share a particular gene expression pattern among the [MATH: n :MATH] genes, the expression of all other genes can vary greatly. This makes it in principle possible for a cell state to be widely dispersed in expression space. For further simulations where fictitious d-tuples are induced or removed from real data, see Supplementary Material, Sec. A.3. Hierarchical clustering of d-tuples Given all d-tuples, Stator creates a cell-by-d-tuple matrix, with binary entries 1 or 0, representing whether or not a cell contains a particular gene d-tuple. Stator then hierarchically clusters these d-tuples (rather than cells) based on a notion of distance, here the Sørensen–Dice coefficient, to identify d-tuples that more commonly co-label the same cells. This hierarchy of separation among d-tuples can be visualised in a dendrogram. Note that the Sørensen–Dice coefficient, sometimes referred to as the Dice similarity coefficient, is not a distance metric because it does not satisfy the triangle inequality. More specifically, the Dice dissimilarity between two boolean vectors [MATH: X :MATH] and [MATH: Y :MATH] is defined as: [MATH: d(X,Y)=1−2∣X∧Y∣∣< mi>X∣+∣Y∣. :MATH] 6 In order to group the d-tuples together, we cut the dendrogram at a Dice dissimilarity that, by default, is set at the value that maximises the weighted modularity score of the resulting clustering (Newman, [432]2006), where a pair of d-tuples is assigned an edge weight of one minus their Dice similarity. At the set Dice dissimilarity threshold, cells expressing these gene d-tuples are grouped together forming Stator states. In particular, cells can exist in multiple multiple Stator states depending on different gene signature similarities. Lowering the Dice value threshold increases granularity, the resolution by which states are predicted, which we have shown, in some instances (e.g., Appendix Fig. S[433]12), to better resolve subtypes or sub-states for large and transcriptionally heterogeneous groups of cells. Stator pipeline Stator is a Nextflow pipeline (written using Nextflow version 21.04) that consists of a main Nextflow script (DSL1) managing a number of Python and R scripts and modules (see Appendix Fig. S[434]3 for an overview of the pipeline). Stator aims to balance modularity and ease-of-use with flexibility, so is fully containerised (Docker images hosted on Dockerhub) and allows the user to specify different preferences and settings in a separate json file, meaning that it