Abstract

Motivation

   The cell cycle is a critical part of cellular life, one that has long
   been studied, both directly, and through its regulatory components.
   Commonly, cell cycle synchronization or selection experiments are
   performed in order to study the cell cycle, thus chemically modifying
   the cells, or selecting them for specific phases. We seek to develop a
   means to study the cell cycle through the use of single cell RNA
   sequencing, effectively circumventing the need for such experiments.

Results

   We utilize a well-established pseudotime method, along with the
   predicted and real expression of genes to calculate the velocity of
   individual genes. We then utilize statistics and expected biological
   behaviour to identify genes with significant shifts in velocity within
   the pseudotime. Additionally, we show the ability to observe gene
   regulatory behaviour such as mRNA splicing and degradation rates. As
   many cell line based research utilize multiple replicates we implement
   a merger method for technical replicates to adjust for technical
   variations, creating a more robust analysis. In summary, our study
   develops a robust approach to map the velocities of individual,
   biologically, and statistically significant genes throughout the cell
   cycle’s phases within a cell line experiment.

Availability and implementation

   Data and code are available at: [32]https://github.com/Ylefol/CC_vel.

1 Introduction

   The cell cycle is a critical component for most living organisms. In
   order to study the cell cycle, cell synchronization is commonly
   utilized ([33]Futcher 1999). Cell synchronization often involves the
   use of a variety of chemicals to synchronize all the cells of a sample
   to a specific phase of the cell cycle. A commonly cited disadvantage of
   this approach is that the environment must first be modified in order
   to synchronize the cells, which in turn may affect the results of
   downstream analyses ([34]Cooper 2019). There are methods to study the
   cell cycle without the utilization of cell synchronization, but these
   typically involve the isolation of cells at specific points of the cell
   cycle. Such methods include fluorescence activated cell sorting, size
   separation, or collecting recently divided cells. Whereas these methods
   do not rely on chemical perturbations of the cells, the selection
   methods can potentially affect gene expression and cellular state
   through mechanical stress ([35]Box et al. 2020, [36]Ryan et al. 2021,
   [37]Wagh et al. 2021). Finally, both synchronization and selection
   methods can provide data at specific states of the cell cycle; however,
   these methods cannot easily be used to study cells through the
   continuous stages of the cell cycle.

   One solution to the study of an entire population of cells within a
   sample is the use of single cell RNA sequencing (scRNA-seq). Early
   scRNA-seq studies have found that cells undergoing cell division can
   form sub-clusters within their respective cell types ([38]Scialdone et
   al. 2015, [39]Chen and Zhou 2017), thus showing that scRNA-seq has the
   ability to identify subsets of dividing cells based on gene expression.
   Furthermore, Scialdone et al. showed that PCA-based classifications
   gave the most accurate phase assignment results, which was also
   confirmed by [40]Kowalczyk et al. (2015) and [41]Tirosh et al. (2016).
   This has been further confirmed by experiments utilizing fluorescent
   ubiquitination cell cycle indicators (FUCCI) to validated cell cycle
   phase assignment ([42]Hsiao et al. 2020). DeepCycle is a machine
   learning based tool created to perform the same task; establishing a
   cell cycle pseudotime and subsequently assign cell cycle phases to
   individual cells ([43]Riba et al. 2022). To then study the dynamics of
   the cell cycle, including ordering cells based on their cell cycle
   progression, the concept of RNA velocity may be used.

   RNA velocity is a term coined to describe changes in a cell’s gene
   expression that can predict the cell’s future state ([44]La Manno et
   al. 2018). The basis of RNA velocity comes from the calculation of the
   relative abundance of precursor messenger RNA (pre-mRNA) as well as
   mature mRNA. Currently, there are two main methods that calculate RNA
   velocity from scRNA-seq data: Velocyto and scVelo. Both are based on
   the same framework of differential equations describing the
   relationship between pre-mRNA and mRNA levels. The primary difference
   between these two methods is their assumptions pertaining to
   transcriptional induction and repression. Velocyto’s steady state model
   assumes that all genes share a common splicing rate and that the
   complete dynamics of gene expression are captured by the sequencing
   data ([45]La Manno et al. 2018). scVelo’s dynamic model assumes that
   these events are dynamic ([46]Bergen et al. 2020). Both methods begin
   by calculating the velocity of individual genes, which is then utilized
   to extrapolate a cell’s fate, including creating a pseudotime ordering
   of the cells.

   Whereas Velocyto and scVelo can provide data-driven characteristics of
   cellular dynamics without any prior information, the known biology of
   the cell cycle, such as the order of cell cycle phases, should be taken
   into account when studying cell cycle dynamics. We therefore propose an
   approach adapted from the method presented by Scialdone et al.; Each
   cell is given a cell cycle score based on their expression of 172
   validated cell cycle genes ([47]Whitfield et al. 2002, [48]Kowalczyk et
   al. 2015, [49]Tirosh et al. 2016), from these scores angle boundaries
   for each cell cycle phase can be drawn within a PCA, cells can then be
   re-assigned to a specific cell cycle phase based on their location in
   the PCA rather than their cell cycle score. By having clear phase
   transition points we are able to establish a pseudotime with a clearly
   defined ‘start’, in this case, the S/G2M transition point. Following
   the establishment of the pseudotime and cell cycle phase boundaries, we
   compute RNA velocity of individual genes, identify genes with
   significant expression changes through the cell cycle, and characterize
   genes based on the pseudotime delays between their pre-mRNA and mRNA
   expression profiles. In this aspect, we utilize both statistics and
   expected biological behaviour of pre-mRNA levels before or at the same
   time as mRNA levels to identify genes with cell cycle dependant
   behaviour. We demonstrate our method on scRNA-seq data from
   unsynchronized human keratinocytes and publicly available data from
   Jurkat and 293T cells and show highly significant overlaps between
   significant cell cycle genes in the three cell lines. Furthermore,
   these genes are highly enriched for known cell cycle functions and for
   genes with increased expression in solid tumours from the cancer genome
   atlas (TCGA) study. Our proposed method is available as a snakemake
   pipeline with an associated tutorial at
   [50]https://github.com/Ylefol/CC_vel.

2 Materials and methods

   Detailed materials and methods can be found in [51]File 1, available as
   [52]supplementary data at Bioinformatics Advances online.

3 Results and discussion

   In scRNA-seq data, genes with known cell cycle-dependent expression can
   be used to cluster cells into groups representing distinct phases of
   the cell cycle ([53]Scialdone et al. 2015, [54]Chen and Zhou 2017).
   This clustering can be readily visualized by plotting cells as a
   function of the first and second principal components (PC1 and PC2) in
   a principal component analysis (PCA) of these genes ([55]Fig. 1A). For
   periodic phase-shifted data, PC1 and PC2 represent a sine, cosine
   decomposition of the data; the angle in the corresponding plane can
   then be used to order the data based on their phase.

Figure 1.

   [56]Figure 1.
   [57]Open in a new tab

   A conceptual figure illustrating the overall workflow contained within
   this manuscript. (A) Depicts a series of three PCA plots showing the
   initial cell cycle phase assignment of individual cells, to the
   filtering process and establishment of cell cycle boundaries, ending
   with the reassignment of cell cycle phases based on the calculated
   boundaries. The last plot also illustrates the established cell cycle
   pseudotime beginning at the S-G2/M boundary due to its consistency
   across cell lines. Blue dots represent G1 cells, green dots are S cells
   and orange represents G2/M cells. (B) Illustrates the calculation of an
   individual genes velocity based on its spliced and unspliced expression
   data. The above plot shows a genes spliced expression and unspliced
   expression where points are cells and lines are means, while the bottom
   plot shows the velocities and confidence intervals (CI) calculated from
   the expression. Genes which have a positive and negative velocity are
   submitted to a t-test to determine their significance. (C) Shows the
   over-representation results when associating significant genes to cell
   cycle phases based on peak spliced expression. The dotplot shows strong
   cell cycle phase related pathways for both S and G2/M while showing
   weak house-keeping pathways for G1. This shows that genes with
   significant shifts in velocity are mostly related to cell cycle
   pathways.

   Here, we present a method that uses the above-mentioned properties to
   analyse the dynamics of gene expression in scRNA-seq data of
   asynchronously dividing cells. Specifically, we (i) use the PCA phase
   angle to establish a cell cycle pseudotime ordering of the cells, (ii)
   use this pseudotime ordering to compute pre-mRNA and mRNA gene
   expression profiles and velocities, (iii) identify genes with
   significant expression changes, and (iv) evaluate their biological
   functions ([58]Fig. 1). Our method is implemented as a snakemake
   pipeline consisting of quality control, establishing cellular
   pseudotime, computing gene velocity, combining technical replicates,
   identifying genes with significant non-zero velocity during the cell
   cycle, computing pseudotime delays between pre-mRNA and mRNA profiles,
   calculating a cell cycle variance threshold, and evaluating functional
   enrichment of significant genes. The following sections describe these
   steps and show the results of using our method on scRNA-seq data from
   three distinct cell lines: data from human keratinocytes (HaCaT)
   generated for this study, and publicly available data from Jurkat and
   293T cells. We also include results of a HeLa dataset as further
   validation.

3.1 Quality control

   The HaCaT cell line is genomically stable and has high differentiating
   and proliferating capacity in vitro ([59]Wilson 2014); we have
   previously used this cell line to study cell cycle-dependent gene
   expression by cell synchronization ([60]Pena-Diaz et al. 2013,
   [61]Hegre et al. 2021). Therefore, to test our proposed method for cell
   cycle analyses of scRNA-seq data, we generated scRNA-seq data from
   asynchronously proliferating HaCaT cells, mixed with mouse embryonic
   fibroblasts at a 9:1 ratio for experimental quality control. Following
   sequencing and data quality control, our dataset consisted of two
   technical replicates with 2399 and 2085 cells and 16 181 and 16 002
   expressed genes, respectively ([62]Table 1, available as
   [63]supplementary data at Bioinformatics Advances online).

3.2 Establishing a pseudotime order

   After selecting genes with known cell cycle-dependent expression
   ([64]Whitfield et al. 2002, [65]Kowalczyk et al. 2015), 172 genes
   remained in each replicate. PCA plots of the resulting data show cells
   clustering based on their individually assigned cell cycle phases in a
   pattern that was very similar between the two technical replicates
   ([66]Fig. 2). Notably, the predicted cell cycle phases were less
   distinct for cells around the G1-S transition point compared to cells
   predicted to be in G2/M. As the number of known genes associated with
   the G1 phase is small compared with those associated with the S and
   G2/M phases [eight compared with 74 and 90 genes, respectively;
   [67]Whitfield et al. (2002)], the G1 predictions can be more
   susceptible to outliers or missing values in the data. Although the
   number of G1 outliers is small, we performed a quick comparison between
   G1 cells within the S cluster with the cells in the G1 cluster, this is
   repeated with the G2M cluster. We isolated differentially expressed
   genes for the G1 cells in each respective cluster ([68]File 2,
   available as [69]supplementary data at Bioinformatics Advances online).
   We then inputted the two list to gprofiler ([70]Raudvere et al. 2019).
   The genes for the G1 cells in the S cluster mapped to cell cycle
   related REAC terms, notably the terms ‘S phase’ and ‘G1/S transition’
   ([71]Table 2, available as [72]supplementary data at Bioinformatics
   Advances online). The genes of the G1 cells in G2M primarily mapped to
   mitotic processes ([73]Table 3, available as [74]supplementary data at
   Bioinformatics Advances online). These results indicate that the
   infiltrating G1 associated cells are indeed where they belong, in the
   case of the S cluster these cells seem to partially mark the transition
   from G1 to S, though are primarily associated to S. Nonetheless we used
   the phase angle distributions of the top-scoring cells assigned to the
   S and G2/M phases to establish phase boundaries in the pseudotime
   ordering. Similarly, due to its more robust identification, we used the
   S-G2/M boundary as pseudotime 0.

Figure 2.

   [75]Figure 2.
   [76]Open in a new tab

   PCA plots represent the process of phase reassignment for both
   replicates. n_counts (counts per cell before normalization) was
   regressed out for both replicates before the reassignment (A, C).
   Following the identification of phase boundaries, cells are reassigned
   to a cell cycle phase (B, D). Each point represents a cell.

   Note that a general assumption made during this study is that the
   majority of the cells do not enter G[0]. This assumption is backed by
   the PCA results ([77]Fig. 2), where we do not distinguish a group of
   cells beyond the three cell cycle phases searched for. G[0] cells would
   be expected to be near the G1 cluster. An argument could be made that a
   small number of G1 associated cells located on the far right-hand side
   of the PCA could be a separate group, though there is no method to
   distinguish between G1 and G[0] cells via gene expression alone.
   Nonetheless, this small group remains inconsequential based on the
   total number of cells found within the dataset.

3.3 Gene expression along a cell cycle pseudotime

   Based on the pseudotime ordering from the PCA analysis, we compute
   normalized pre-mRNA and mRNA expression profiles and velocities,
   illustrated in [78]Fig. 3 for the gene uracil-DNA glycosylase (UNG) and
   DNA Topoisomerase II Alpha (TOP2A). The pre-mRNA velocity serves as a
   ‘prediction’ for the mRNA expression, where positive velocity
   represents an increase in active transcription while negative velocity
   represents a decrease in transcription.

Figure 3.

   [79]Figure 3.
   [80]Open in a new tab

   Expression (top) and velocity (bottom) of the TOP2A (left) and UNG
   (right) genes for the HaCaT cell line with merged replicates. The
   expression data are shown in the upper plot where individual dots show
   the expression of individual cells while the thicker lines show the
   smoothed expression data. The velocities are seen in the lower plots.
   Upper confidence intervals are shown as dashed lines while lower
   confidence intervals are solid lines. Red lines between the upper and
   lower plots show the location of peaks (upper and lower) in spliced
   expression. The red line shows that the expression peaks mark when
   velocity shifts from positive to negative or vice versa.

   The UNG gene plays an important role in DNA damage repair and is
   expected to be highly active in S phase ([81]Slupphaug et al. 1991,
   [82]Hagen et al. 2008, [83]Tirosh et al. 2016). In our results, active
   UNG transcription (positive velocity) starts in mid-G1 phase, peaks in
   S phase, and ends (negative velocities) before the end of S phase
   ([84]Fig. 3). These results are consistent with ([85]Mjelle et al.
   2015), which found that UNG peaks in late G1/early S while also
   declining within S phase in data from synchronized cells. TOP2A plays a
   wide variety of roles, one particular role seems to be the adjustment
   of DNA topology during the mitotic stage of the cell cycle, where the
   absence of TOP2A results in an early exit of Mitosis ([86]Nielsen et
   al. 2020). Along with this, TOP2A, along with TOP2B, is known to play
   an important role in the proliferation of a variety of cancers
   ([87]Uusküla-Reimand and Wilson 2022). Put together, this implicates
   TOP2A as a gene which is highly active in G2M, this is correctly
   represented by our results ([88]Fig. 3).

3.4 Comparison with DeepCycle

   We compare our adapted pseudotime and phase assignment method with
   DeepCycle ([89]Riba et al. 2022). DeepCycle identified all three phase
   transitions (M/G1, G1/S, S/G2) in HaCaT A ([90]Fig. 1, available as
   [91]supplementary data at Bioinformatics Advances online) but failed to
   detect the M/G1 transition in HaCaT B. DeepCycle determines M/G1 by
   identifying the first transcriptional phase bin where RNA counts (UMI)
   per cell drop, a method designed for cell populations rather than cell
   lines, which may explain its failure in HaCaT B. For HaCaT A, where the
   transition was detected, DeepCycle’s phase assignments differ markedly
   from ours ([92]Figs 2 and [93]3, available as [94]supplementary data at
   Bioinformatics Advances online). It assigns TOP2A’s peak to G1 instead
   of G2M ([95]Nielsen et al. 2020, [96]Uusküla-Reimand and Wilson 2022)
   and UNG’s peak correctly to late G1/early S but inaccurately shows its
   expression increasing in late S instead of G1 ([97]Mjelle et al. 2015).
   These results suggest that while DeepCycle generates a pseudotime, it
   fails at accurate phase assignment and only works for one of our two
   HaCaT replicates. Thus, we do not use DeepCycle moving forward.

3.5 Merging technical replicates

   When comparing the technical replicates in the HaCaT data, we noted
   some variation between spliced and unspliced velocities, as seen in the
   velocities for the UNG and TOP2A genes ([98]Fig. 4, available as
   [99]supplementary data at Bioinformatics Advances online). To
   accommodate our analyses to variation between replicates, we adapted
   the method in order to use replicates to determine which areas of a
   gene’s velocity is consistent between replicates thus increasing the
   robustness of the method. This is illustrated by 99% CI for each gene’s
   velocity curves ([100]Fig. 5, available as [101]supplementary data at
   Bioinformatics Advances online).

3.6 Determining gene significance and delay

   Genes with cell cycle-dependent gene expression should have both
   increasing and decreasing expression throughout the cell cycle
   pseudotime. To identify genes of interest, we, therefore, start by
   filtering for genes which exhibit both positive and negative velocity,
   which in the HaCaT data, resulted in 5963 genes which we term
   ‘rank-able’. We then used a Student’s t-test on the gene’s peak mRNA
   velocity to identify significant genes, which, after correcting for
   multiple testing, identified 4961 significant genes (adjusted
   [MATH: <mrow><mi>P</mi><mo><</mo><mn>0.01</mn></mrow> :MATH]
   ).

   To further characterize the expression dynamics of these genes, we
   measured the pseudotime delay for the rank-able and significant genes
   ([102]Fig. 4). Delays are defined as the pseudotime difference between
   mRNA and pre-mRNA velocities, where a positive delay indicates that a
   change in pre-mRNA velocity occurred before the same change in mRNA
   velocity. Negative delay values indicate the opposite.

Figure 4.

   [103]Figure 4.
   [104]Open in a new tab

   Raincloud plot showing two delay categories using two different sets of
   genes. Orange shows the ‘increase to 1’ and blue shows ‘decrease to 0’.
   Significant genes are genes with an adjusted P value (from a t-test)
   below 0.01. These plots show boxplots, violin plots, and jitter plot of
   the log10 transformed delay values for each category.

   To be consistent with a model where changes in pre-mRNA levels are
   followed by changes in mRNA levels, gene-level delays should be zero or
   positive. This was the case for most of the rank-able and significant
   genes in our results when considering the delay between maximum mRNA
   and pre-mRNA expression [‘decrease to 0’ (blue) in [105]Fig. 4], where
   both curves are likely to be most accurate. In contrast, the category
   illustrating the beginning of active transcription to peak velocity
   [increase to +1 (orange)] shows that most genes had negative delays
   between mature mRNA and pre-mRNA expression indicating that the
   prediction potential of pre-mRNA in regards to mRNA is stronger when
   expression heads towards, and reaches, its peak.

   Based on the assumption that genes with cell-cycle-dependent expression
   should all have zero or positive delays, we sought to identify the
   genes that had significant negative delays, thereby violating the
   transcriptional model’s assumptions. To identify the genes with
   significant negative delays, we modelled the empirical delay
   distribution as a Gaussian mixture model to split the trimodal delay
   distribution into three normally distributed curves, we then determined
   the 95th percentile of the normal distribution with a mean value of, or
   closest to 0, and used that value as a negative delay threshold
   ([106]Fig. 5, available as [107]supplementary data at Bioinformatics
   Advances online).

   Using this method we identified several significant negative delay
   genes, specifically 170 for HaCaT, 1000 for 293T, and 590 for Jurkat
   ([108]File 1, available as [109]supplementary data at Bioinformatics
   Advances online). In order to investigate if these genes were related
   to a biological condition that our pseudotime did not accommodate, we
   performed an over-representation analysis with each gene list
   ([110]Tables 4–6, available as [111]supplementary data at
   Bioinformatics Advances online). These analyses revealed no consistency
   between cell lines thus indicating that the likely cause of these
   significant negative delays is due to technical artefacts. We therefore
   validate the need to filter out these genes. Although we utilize these
   delays to filter out technical artefacts, we hypothesize that they may
   be used in a wider application. They could also provide insights into a
   gene’s post-transcriptional regulatory mechanisms via its rate of
   pre-mRNA splicing and mRNA degradation. Specifically, an increase in
   the rate of mRNA splicing would create an increased distance between
   peak pre-mRNA and peak mRNA expression, whereas an increase in the rate
   of mRNA degradation would decrease the same distance. Consequently, the
   genes with small delays in peak expression may have distinct
   post-transcriptional regulatory mechanisms from those with longer
   delays.

3.7 Comparing significant genes between cell lines

   To investigate the extent with which our methods and results
   generalized into other data, we ran our analysis pipeline on publicly
   available scRNA-seq data from two other cell lines: the human T
   lymphocyte cell line Jurkat and the human embryonic kidney cell line
   293T. Quality control ([112]Tables 7 and [113]8, available as
   [114]supplementary data at Bioinformatics Advances online) and
   pseudotime assignment ([115]Figs 6 and [116]7, available as
   [117]supplementary data at Bioinformatics Advances online) showed that
   the four technical replicates for each cell line were comparable but
   slightly more variable than those from our HaCaT data. We identified
   4334 and 4369 significant genes in the 293T and Jurkat data and these
   genes showed similar patterns in delays as the significant HaCaT genes
   ([118]Fig. 8, available as [119]supplementary data at Bioinformatics
   Advances online, [120]File 1, available as [121]supplementary data at
   Bioinformatics Advances online). With an odds ratio greater than one
   representing a higher likelihood of overlap of significant genes, we
   observe that general gene overlap, irrespective of cell cycle phase,
   shows a lower odd of obtaining significant gene overlap ([122]Table 9,
   available as [123]supplementary data at Bioinformatics Advances
   online). To further evaluate these results, we investigated to what
   extent genes with specific phase assignments overlapped between the
   cell lines. We used three possible ways of assigning genes to the three
   main pseudotime phases (G1, S, and G2/M), based on the locations of
   peak velocity, peak expression, and start of transcription (i.e.
   velocity > 0). Though variation is observed between the methods, the
   general trend is a odds ratio greater than one along with a significant
   P value for genes associated with the S and G2/M phases. Overall, phase
   assignment based on expression appeared to be the better method, likely
   due to the use of gene expression for the establishment of the cell
   cycle pseudotime, as opposed to gene velocity.

   The altering direction of the odds ratio when comparing the general
   overlap with cell cycle phase overlap points towards some variability
   within the data. To accommodate this, we implement the use of a delay
   threshold whose identification is described in the previous section. We
   again compute the gene and phase overlap, but this time limiting the
   phase overlap to the peak expression category ([124]Table 10, available
   as [125]supplementary data at Bioinformatics Advances online). These
   results show some improvement of significance in the phase assignment,
   however there are still indications of strong variability within the
   data. This is further explored in the section titled: Filtering for
   house-keeping genes.

3.8 Applying the Velocyto model

   For comparisons, we applied the Velocyto model to the scRNA-seq data
   from the three cell lines ([126]Fig. 5; [127]Figs 6 and [128]7,
   available as [129]supplementary data at Bioinformatics Advances
   online). We obtained velocity fields which clearly show the different
   cell cycle phases along with their predicted next phase. Velocyto was
   able to accurately predict the transition from S phase to G2/M phase
   and the transition from G2/M to G1. The G1 to S transition is not
   predicted by Velocyto as it is not exclusively dependant on gene
   expression. It is instead thought to largely depend on critical cell
   size ([130]Barberis et al. 2007) and therefore cannot be visualized
   using this method. However, we note that the overall trend, represented
   by the continuous black line in [131]Fig. 5, shows a trajectory from G1
   to G2/M. Velocyto also predicted the same overall trend in all four
   Jurkat and 293T replicates ([132]Figs 6 and [133]7, available as
   [134]supplementary data at Bioinformatics Advances online).
   Additionally, we check the results obtained from scVelo ([135]Fig. 11,
   available as [136]supplementary data at Bioinformatics Advances
   online), which show differing results when compared to Velocyto, where
   the velocity stream and latent time end somewhere between the S and G2M
   clusters. The divergence between Velocyto and scVelo show that neither
   is ideal for the creation of a cell cycle pseudotime, however the
   Velocyto results show a behaviour than can be explained through biology
   while the same cannot be said with scVelo where the terminal state is
   situated roughly between S and G2M.

Figure 5.

   Figure 5.
   [137]Open in a new tab

   The velocity fields for both technical replicates, the field represents
   the different phases of the cell cycle and the expected differentiation
   path based on gene expression. The principal curve shows the overall
   directionality of the data points. Each point in the velocity field
   represents an individual cell.

3.9 Pathway enrichment for pseudotime validation

   To biologically validate our findings, we performed an
   over-representation analysis using gprofiler ([138]Raudvere et al.
   2019). Considering three sets of genes assigned to the three cell cycle
   phases, we expect the sets to be enriched for functions related to
   their respective phases; i.e. we expect S phase genes to be enriched
   for functions related to DNA replication and DNA repair. Indeed, when
   analysing the significant HaCaT genes phase assigned based on peak
   expression, both S and G2/M phase genes were enriched for expected
   biological pathways ([139]Fig. 6).

Figure 6.

   [140]Figure 6.
   [141]Open in a new tab

   A dotplot showing REACTOME terms found for the HaCaT cell line, with a
   maximum of 10 terms per category. Top terms are selected based on false
   discovery rate (FDR) and the odds ratio (OR). The x axis shows the cell
   cycle phases. Term names are shown on the y axis. The size of each dot
   is proportional to the OR of the associated term on the y axis. The
   colour scale is created using a -log10 transformation of the FDR which
   was calculated using the Benjamini-Hochberg approach.

   For S phase, the most significant terms were related to DNA repair,
   which is consistent with DNA repair having a critical role during DNA
   replication ([142]Hustedt and Durocher 2016). For G2/M, we primarily
   observed mitosis-related terms. Genes in G1 were primarily defined by
   translational initiation and preparation of ribosomal subunits.
   Although G1 phase is defined by cellular growth driven by protein
   production, these pathways are broad in nature and may be associated
   with general house-keeping functions rather than G1 specifically (see
   the next section). Running the same over-representation analysis on the
   Jurkat, 293T and HeLa results, gave strikingly similar results across
   the different cell lines ([143]Fig. 12, available as [144]supplementary
   data at Bioinformatics Advances online). Noticeably, in both the 293T,
   Jurkat and HeLa cell lines, S phase associated pathways were also found
   within G1, but at a reduced significance. This indicates that the
   assigned G1-S boundary in these two datasets may be slightly shifted in
   favour of G1. We initially suspected that the number of cell cycle
   genes found would be the cause of this, however, the differences
   between cell lines are small ([145]Table 11, available as
   [146]supplementary data at Bioinformatics Advances online). An
   additional possibility is that there are alternate sources of
   variability present within the data thus affecting the phase assignment
   as indicated by both the PCAs ([147]Figs 6 and [148]7, available as
   [149]supplementary data at Bioinformatics Advances online) and the
   velocity fields ([150]Figs 9 and [151]10, available as
   [152]supplementary data at Bioinformatics Advances online). An
   over-representation analysis of the genes which did not pass the t-test
   filter ([153]Fig. 13, available as [154]supplementary data at
   Bioinformatics Advances online) found no significant REACTOME
   cell-cycle terms, suggesting that the t-test filter retained the
   cell-cycle related genes.

3.10 Verifying the G1 phase results

   To investigate the broad REACTOME pathways found to be associated with
   the G1 phase, we plotted the gene trajectories of selected REACTOME
   pathways associated with the G1 or G2/M phases ([155]Fig. 7). Whereas
   both the genes associated with G1 and G2/M phase pathways mainly peaked
   in their respective phases, the genes in the G2/M phase pathways
   generally had much higher peak expression than the genes in the G1
   pathways. Specifically, the expression trajectories for the G1 phase
   pathway genes seemed to be consistent with constant gene activity and a
   very mild increase in G1. Such a low increase could potentially be
   explained by large fluctuations in cell cycle genes expressed in S and
   G2/M phases causing small but consistent inverse fluctuations in the
   relative expression profiles of house-keeping genes rather than cell
   cycle-dependent expression.

Figure 7.

   [156]Figure 7.
   [157]Open in a new tab

   Line plots for the HaCaT cell line showing spliced and unspliced
   expression trajectories for genes from selected REACTOME pathways.
   Plots A and B highlight pathways peaking in G2/M—Mitotic Prometaphase
   and Resolution of Sister Chromatid Cohesion—while Plots C and D focus
   on G1 peaking pathways—Selenocysteine Synthesis and Eukaryotic
   Translation Initiation. Only genes with an adjusted p value < 0.01 are
   included, and each pathway’s mean spliced and unspliced trend is
   overlaid as a bold, dark line. Values are scaled, and plot titles list
   both the number of genes shown and the total genes in each REACTOME
   pathway.

3.11 Filtering for house-keeping genes

   To identify and remove genes with house-keeping behaviour, we
   implemented a gene filtering method based on gene expression throughout
   the cell cycle phases, removing genes having a cell cycle variability
   below a certain threshold. To determine a basis for the threshold, we
   calculated the mean variability of the genes in the two G2/M and G1
   REACTOME pathways in [158]Fig. 7. The variability was calculated based
   on the log10 transformed gene expression ([159]Table 12, available as
   [160]supplementary data at Bioinformatics Advances online) in order to
   account for the heteroskedasticity of count data as well as different
   genes having different average expression levels potentially due to
   their varying functions.

   Based on the expression variation for the putative housekeeping genes
   compared with the G2/M phase genes, we decided to filter out genes
   based on the log10 cell cycle variance, where the median log10 variance
   for each pathway is calculated, and the mean of the median is selected
   as the variance threshold. Hence this creates a variance threshold that
   will be specific for each cell line. Following this filter (in addition
   to the t-test and delay filters), 1140, 2213, and 1983 significant
   genes remained for HaCaT, 293T, and Jurkat respectively. Subsequent
   pathway enrichment analysis of the remaining genes showed a marked
   decrease in the significance of the remaining G1 genes ([161]Fig. 14,
   available as [162]supplementary data at Bioinformatics Advances
   online). All G1 pathways were no longer considered significant and
   therefore do not appear in the plot. Furthermore, the G1 genes passing
   the variation filter showed marked increased expression during the G1
   phase ([163]Fig. 15, available as [164]supplementary data at
   Bioinformatics Advances online). We found similar results in the other
   three cell lines ([165]Fig. 16, available as [166]supplementary data at
   Bioinformatics Advances online), with the exception that certain G1
   pathways remain, however at a reduced significance. We also perform a
   chi-square analysis for gene overlap, with this variance filter, in
   addition to the previous measures of identifying significant genes
   (t-test and negative delay threshold) ([167]Table 13, available as
   [168]supplementary data at Bioinformatics Advances online) In this
   table we observe several significant odd ratios above one, indicating a
   higher likelihood of overlap, these odds are increased from the
   previous gene overlap analysis which used the t-test and delay
   thresholding alone ([169]Table 10, available as [170]supplementary data
   at Bioinformatics Advances online). Put together, these findings
   confirmed that filtering genes with low variation (below the mean of
   the median) effectively removed genes whose apparent periodic
   expression could be experimental artefacts.

3.12 Cross-reference significant cell cycle genes with TCGA

   Genes with cell-cycle dependent expression tend to be upregulated in
   dividing cells. To further validate the significant genes found using
   the t-test, delay threshold, and cell cycle variance threshold, we
   sought to compare them with upregulated genes within The Cancer Genome
   Atlas (TCGA). We, therefore, identified genes (826) up-regulated in all
   registered cancers compared with the corresponding normal tissues in
   the TCGA database ([171]Giordano 2014). Our assumption was that these
   genes will primarily be cell cycle genes upregulated in cancers in
   general and should therefore be found in our list of identified
   significant genes.

   When comparing the TCGA genes with the filtered genes from our cell
   cycle analyses, all three cell lines had significant overlap with the
   TCGA genes ([172]Tables S14 and [173]15, available as
   [174]supplementary data at Bioinformatics Advances online). The overall
   cell cycle phase trends were also similar, with G1 genes having similar
   to or significantly less overlap than expected and S and G2/M having
   significantly more overlapping genes than expected. This trend was even
   stronger when only considering common significant genes between the
   three cell lines. These results indicate that genes found to be
   significant in both G2/M and S phase are relevant for cancer
   proliferation, whereas most or all genes found to be associated with
   the G1 phase are not.

4 Conclusion

   We have shown that our scRNA-seq protocol creates reproducible results
   by comparing all results with its technical replicate counter part. We
   also demonstrated the protocol’s ability to sort cells along a cell
   cycle based pseudotime order while using the S-G2/M transition point as
   the start of the pseudotime. Additionally, we validated the method that
   calculates and uses gene velocity to identify genes of interest. These
   genes are considered to be significant by a t-test based on the
   presence of positive and negative gene velocity. The list of relevant
   genes is further filtered via the removal of genes with significant
   negative cell cycle delays and house-keeping behaviour. Using our
   filtered significant genes we found significant overlap with genes
   up-regulated in cancers from the TCGA database.

   Our method for creating a pseudotime ordering operates on the
   assumption that a gene list of 172 genes is representative of the cell
   cycle and that the expression patterns of these genes are mainly
   related to the cell cycle. Although this may be a limitation in certain
   cases, it may double as an advantage. This assumption allows this
   method to be easily transferable to various biological pathways
   provided that they contain an adequate number of representative genes.

   A second limitation of this study is the case of variability within
   scRNA-seq data. We have demonstrated that variability within technical
   replicates exists and that we can adjust for this variability by
   merging technical replicates. However, not all studies may have
   technical replicates available. This aspect should be considered when
   employing this method with single replicate studies.

   Overall, we show an efficient way of studying the cell cycle and its
   various elements all while circumventing the need to perform cell cycle
   synchronization or selection experiments.

Supplementary Material

   vbaf123_Supplementary_Data
   [175]vbaf123_supplementary_data.zip^ (22.3MB, zip)

Acknowledgements