Graphical abstract

   graphic file with name fx1.jpg
   [34]Open in a new tab

Highlights

     * •
       Highly variable genes (HVGs) as an alternative to DEGs for
       time-series scRNA-seq data
     * •
       Visualizing dynamic expression patterns of HVGs over multiple cell
       types in one figure
     * •
       Exploring biological pathways with common and cell-type-specific
       expression dynamics
     __________________________________________________________________

   Publisher’s note: Undertaking any experimental protocol requires
   adherence to local institutional guidelines for laboratory safety and
   ethics.
     __________________________________________________________________

   Here, we present a computational approach for investigating highly
   variable genes (HVGs) associated with biological pathways of interest,
   across multiple time points and cell types in single-cell
   RNA-sequencing (scRNA-seq) data. Using public dengue virus and COVID-19
   datasets, we describe steps for using the framework to characterize the
   dynamic expression levels of HVGs related to common and
   cell-type-specific biological pathways over multiple immune cell types.

Before you begin

Overview

   Investigation of differentially expressed genes (DEGs) between two
   biological conditions or paired samples (e.g., control vs.’ treatment,
   or healthy vs.’ disease samples) is one of the key steps in genome-wide
   gene expression analyses, such as transcriptomic and proteomic studies.
   With single-cell RNA-sequencing (scRNA-seq) technologies, we are now
   able to look into transcriptomic profiles of individual cells, and
   hence perform DEG analyses on different cell types and suppopulations.
   Alternatively, one can also characterize a set of genes that are
   differentially expressed between cell types, frequently referred to as
   marker genes. However, most current computational methods were intended
   for the comparison between two conditions of interest and/or on
   specific cell type at the
   time,[35]^2^,[36]^3^,[37]^4^,[38]^5^,[39]^6^,[40]^7^,[41]^8^,[42]^9 but
   not for investigating gene expression dynamics across more than two
   conditions, such as in a time-course dataset.

   Here, we present a framework for identifying and investigating highly
   variable genes (HVGs) that demonstrate highly dynamic expression
   patterns across several time points (or biological conditions). Using
   our workflow, one can visualize relative expression levels of HVGs
   associated with biological pathways of interest in multiple time points
   and across different cell types in one analysis. Using publicly
   available time-course scRNA-seq datasets of dengue[43]^1 and
   SARS-CoV-2[44]^10 infections, we have demonstrated how the protocol can
   be used to extract HVGs, their biological processes, and gene
   expression patterns that are unique to certain cell types, or shared
   across multiple populations of cells.

Install tools/packages

     * 1.
       Installation of R ([45]https://www.r-project.org/), and optionally,
       RStudio ([46]https://www.rstudio.com/).
     * 2.
       Installation of the relevant R packages.
          + a.
            All the R packages required from the analyses are listed under
            the Software and Algorithms section of the [47]key resources
            table.
          + b.
            Users can also download the Docker Hub, URL:
            [48]https://hub.docker.com/r/jantarika/rstudio_denguetimecours
            e, that contains all the R packages used in this protocol.

Download scRNA-seq datasets

     * 3.
       Download the datasets, which was used as an example in this
       protocol.
          + a.
            Download the datasets mentioned in the Deposited Data section
            of the [49]key resources table.
          + b.
            Alternatively, the complete datasets used for this analysis
            are also deposited to Mendeley Data:
            [50]https://data.mendeley.com/datasets/6ry3x7r8hf/3.

Institutional permission

   The data described in this article were originally published by Arora
   and Opasawatchai and colleagues,[51]^1 which was approved by the
   Institutional Review Boards of Faculty of Medicine Vajira Hospital
   (No.015/12), Faculty of Tropical Medicine Mahidol University (TMEC
   13041) and Faculty of Medicine, Ramathibodi Hospital, Mahidol
   University (MURA2019/603).

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   Dataset: Raw sequencing data of 4 time-points from DF and DHF patients
   and a healthy donor Arora et al.[52]^1 ArrayExpress: E-MTAB-9467
   Dataset: 4k PBMCs from a healthy donor 10x Genomics Single Cell Gene
   Expression Datasets
   [53]https://support.10xgenomics.com/single-cell-gene-expression/dataset
   s/2.1.0/pbmc4k
   Complete datasets used for this protocol Mendeley Data
   [54]https://data.mendeley.com/datasets/6ry3x7r8hf/3
   Dataset: Processed scRNA-seq datasets of COVID-19 PBMC samples
   Schulte-Schrepping et al.[55]^10 [56]http://fastgenomics.org
   Algorithms and computer codes, and the version record Arora
   et al.[57]^1 [58]https://github.com/vclabsysbio/scRNAseq_DVtimecourse
   [59]https://doi.org/10.5281/zenodo.7968936
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   RStudio 4.0.2 RStudio [60]https://www.rstudio.com/
   Seurat v3.1.2 Stuart et al.[61]^2 [62]https://satijalab.org/seurat/
   SoupX v1.0.1 Young and Behjati.[63]^11
   [64]https://github.com/constantAmateur/SoupX
   DoubletFinder v2.0.3 McGinnis et al.[65]^12
   [66]https://github.com/chris-mcginnis-ucsf/DoubletFinder
   gProfier2 v0.1.8 Raudvere et al.[67]^13
   [68]https://biit.cs.ut.ee/gprofiler/
   ggplot2 v3.3.2 Wickham.[69]^14 [70]https://ggplot2.tidyverse.org/
   tidyverse v.2.0.0 Wickham et al.[71]^15
   [72]https://www.tidyverse.org/packages/
   [73]Open in a new tab

Materials and equipment

Bioinformatic analyses

   All the bioinformatic analyses presented here were tested on an
   in-house server (Intel Core Processor (Broadwell, IBRS), 39 CPUs, and
   480 GB RAM.), running on Ubuntu 16.04.6 LTS.

Step-by-step method details

   Before identification and investigation of time-course HVGs and their
   dynamic expression profiles, the scRNA-seq data must be pre-processed
   and quality controlled, which are summarized briefly here, and are
   described in details through GitHub
   ([74]https://github.com/vclabsysbio/scRNAseq_DVtimecourse). Note that
   these steps have also been comprehensively described
   elsewhere.[75]^16^,[76]^17^,[77]^18

QC and filtering single-cell RNA-seq data

     Inline graphic Timing: ∼ 1–2 h/sample (for steps 1 to 3)

   The quality control and filtering scRNA-seq data were performed
   separately on individual samples, before data integration and cell type
   annotation ([78]Figure 1). The QC stage comprises correcting of ambient
   RNAs, exclusion of dead and low quality cells and potential doublets,
   in prior to the downstream bioinformatic analyses. Raw and filtered
   expression matrices generated by the cellranger count function are
   required as the inputs of this step - see [79]key resources table.

     Note:[80]Figure 1 illustrates the overview of the bioinformatic
     pipeline described in this protocol.

     Note: R packages and codes required for the QC and filtering step
     are provided in Github URL:
     [81]https://github.com/vclabsysbio/scRNAseq_DVtimecourse. Complete
     datasets used for this step are deposited in Mendeley Data:
     [82]https://data.mendeley.com/datasets/6ry3x7r8hf/3.

     * 1.
       Correct ambient RNAs using SoupX.[83]^11

     Note: Check for the expression levels of ambient RNAs in your
     datasets. If an excessive amount is observed, users can apply
     several ambient RNA removal tools such as SoupX[84]^11 to correct
     their expression levels. For more details of predicting and
     correcting the expression values of ambient RNAs, please refer to
     the SoupX tutorial.[85]^11

     * 2.
       Exclude the cells expressing excessive mitochondrial genes, in this
       example, more than 10% in the total transcript counts.

     Note: Check the distribution of percent mitochondrial genes
     (percent.mt) to determine suitable cut-offs in your scRNA-seq
     datasets. For details, please refer to the Seurat toolkit.[86]^2

     * 3.
       Predict and discard doublets using doubletFinder.[87]^12

     Note: For droplet-based scRNA-seq, we recommend excluding “doublets”
     and “multiplets”. For more details about the optimal cut-offs in
     each parameter, please refer to doubletFinder.[88]^12

Figure 1.

   [89]Figure 1
   [90]Open in a new tab

   Overview of bioinformatic pipeline of scRNA-seq data analysis described
   in this protocol

   HVGs, highly variable genes.

Data normalization, integration, and cell type annotation

     Inline graphic Timing: ∼ 2 h (for steps 4 to 7)

   After filtering out low quality cells and correcting the expression
   values of ambient RNAs, the data are used as inputs for normalization,
   integration, and cell type annotation. Seurat objects of individual
   samples after QC and filtering step are used as the inputs of this
   step.
     * 4.
       Run the standard preprocessing workflow for each individual sample.
          + a.
            Import and create a list of individual Seurat objects after
            the QC and filtering step.
          + b.
            Run the SCTransform function, a regularized negative binomial
            regression.[91]^19
          + c.
            Apply the first 30 principal components (PCs) for cell
            clustering.
          + d.
            Identify cell clusters by the shared nearest neighbor (SNN)
            method using the Louvain algorithm with multilevel refinement.
     * 5.
       Integrate all the samples. We used the following settings in our
       examples here: three thousand (3,000) features, a Louvain algorithm
       with multi-level refinement for clustering and the resolution of
       three. Other parameters are as default.
     * 6.
       Normalize the transcript counts of each cell using the
       NormalizeData() function.
     * 7.
       Identify major immune cell types using known canonical marker genes
       (as described in [92]Table S1).

     Note: For cell type annotation, in our examples we first labeled
     different clusters of T cells with the same name in order to group
     them together as a single “T cells” cluster, which was subsequently
     re-clustered into subpopulations.

     Note: High heterogeneity of PBMCs, especially at the subpopulation
     levels, might be difficult to identify using the known marker genes
     ([93]Table S1). Alternatively, reference-based for cell type
     annotation such as SingleR,[94]^20 Azimuth,[95]^21 and ScType[96]^22
     can be applied.

Highly variable gene (HVG) identification and pathway enrichment analysis
across multiple time points – Dengue case study

     Inline graphic Timing: ∼10–15 min (for steps 8 to 14)

   Here, we describe the process for obtaining HVGs, which represent the
   variations in transcriptional levels across several biological
   conditions (typically >2 conditions, in our case, four time points from
   two DENV-infected patients and two healthy controls) in multiple immune
   cell types of interest. The inputs of this step are the integrated
   scRNA-seq Seurat objects from the previous pro-processing and cell type
   annotation described above ([97]Figure 1).

     Note: We first performed a pseudo-bulk expression analysis by
     calculating the average transcription levels of all the genes. Then,
     principal component analysis (PCA) of the averaged gene expression
     in each cell type was performed ([98]Figure 2A). This PCA of each
     cell type was used to extract the genes that demonstrate the largest
     variation across biological conditions, – to be referred to as
     “Highly Variable Genes (HVGs)”. In this example, the most apparent
     differences are between the four time points, whereas those between
     the two DENV patients are relatively small. Finally, we investigated
     the biological pathways associated with the HVGs that show common
     and unique expression dynamics among different cell types and
     biological conditions ([99]Figure 2B). The workflow is described
     step-by-step below.

     Note: Processed scRNA-seq dataset used in this step is deposited in
     Mendeley Data: [100]https://data.mendeley.com/datasets/6ry3x7r8hf/3

Figure 2.

   [101]Figure 2
   [102]Open in a new tab

   The relative expression levels of highly variable genes (HVGs) in each
   biological process (BP) of interest, across the four time points and
   major immune cell types, from two dengue-infected patients and two
   healthy controls

   (A and B) Figures were modified from Arora et al., 2022[103]^1 (A). PCA
   of average gene expression in each major immune cell type of interest.
   (B). Dotplot representing relative expression levels of HVGs. The BPs
   of the HVGs across the four major immune cell types of interest here
   are highlighted in red texts, monocyte-specific BPs are in blue, and B
   cell-specific BPs are in green.
     * 8.
       Subset the integrated Seurat object according to their major cell
       types into monocytes, NK cells, T cells, and B cells/plasma
       cells/plasmablasts - the main immune cell types of interest in this
       study.

   # Load libraries

   library(Seurat)

   library(gprofiler2)

   library(ggplot2)

   # Set your working directory, pointing to the folder where all your
   input and output files will be saved

   setwd("PATH/TO/YOUR/WORKING/DIRECTORY")

   # Load integrated data

   sc_integrated <- readRDS(file =
   "PATH/TO/YOUR/WORKING/DIRECTORY/sc_integrated.rds")

   # Subset each cell type

   Idents(sc_integrated) <- "Cell_Types"

   each_celltype_list <- list()

   each_celltype <- c("Monocytes" , "NK cells" , "T cells" , "B cells")

   for (RN in each_celltype) {

    each_celltype_list[[RN]] <- subset(sc_integrated , idents = RN)

   }

   names(each_celltype_list) <- each_celltype
     * 9.
       Calculate the average gene expression levels in each cell type
       across biological conditions of interest.
          + a.
            Set the object’s identity class based on condition of interest
            using the Idents() function.
          + b.
            Calculate the average gene expression using the
            AverageExpression() function.

   # Calculate average gene expression in each cell type across severity
   and time

   Avg_expression_list <- list()

   for (RN in 1:length(each_celltype_list)) {

    Idents(each_celltype_list[[RN]]) <- "ST"

    Avg_expression_list[[RN]] <-
   AverageExpression(each_celltype_list[[RN]] , assays = "RNA" , slot =
   "data")

    Avg_expression_list[[RN]] <-
   as.data.frame(Avg_expression_list[[RN]]$RNA)

   }

   names(Avg_expression_list) <- names(each_celltype_list)
     * 10.
       Construct the principal components (PCs) of the average gene
       expression levels in each cell type using the prcomp() function in
       R ([104]Figure 2A).

   # Construct Principal Components (PCs) in each cell type

   pca_out <- list()

   pca_perc <- list()

   df_pca <- list()

   for (RN in 1:length(Avg_expression_list)) {

    pca_out[[RN]] <- prcomp(t(Avg_expression_list[[RN]]))

    pca_perc[[RN]] <-
   round(100∗pca_out[[RN]]$sdevˆ2/sum(pca_out[[RN]]$sdevˆ2),1)

    df_pca[[RN]] <- data.frame(PC1 = pca_out[[RN]]$x[,1], PC2 =
   pca_out[[RN]]$x[,2], sample = colnames(Avg_expression_list[[RN]]))

    # Add metadata can be differences in each dataset

    df_pca[[RN]]$Severity <- c(rep("DF" , 4) , rep("DHF" , 4) , "Healthy"
   , "Healthy")

    df_pca[[RN]]$Time <- c("Day-2" , "Day-1" , "Def" , "Wk2" , "Day-2" ,
   "Day-1" , "Def" , "Wk2" ,"Healthy I" , "Healthy II")

    df_pca[[RN]]$Time <- factor(df_pca[[RN]]$Time , levels = c( "Day-2" ,
   "Day-1" , "Def" , "Wk2" , "Healthy I" , "Healthy II"))

   }

   names(pca_out) <- names(Avg_expression_list)

   names(pca_perc) <- names(Avg_expression_list)

   names(df_pca) <- names(Avg_expression_list)

     Note: Please add metadata based on your datasets.

     * 11.
       Visualize the PCA results using the ggplot2 package.[105]^14

   # Visualize PCA results

   for (RN in 1:length(df_pca)) {

    pca_plot<- ggplot(df_pca[[RN]], aes(PC1,PC2, color = Time))+
   geom_point(aes(shape = Severity ), size=6 , stroke = 1.4)+
   labs(x=paste0("PC1 (",pca_perc[[RN]][1],"%)"), y=paste0("PC2
   (",pca_perc[[RN]][2],"%)")) +
   scale_color_manual(values=c("darkgoldenrod2", "#ff7400","#ff1218",
   "#47849c" , "darkgrey" , "gray6")) + theme(axis.text =
   element_text(size = 17 , face="bold" , colour = "black") ,
   axis.title.y = element_text(color="black", size=15, face="bold") ,
   axis.title.x = element_text(color="black", size=17, face="bold") ,
   legend.title = element_text(face = "bold" , size = 17) , legend.text =
   element_text(size = 16) , legend.key.size = unit(1, "cm") ,
   legend.key.width = unit(0.5,"cm") , legend.key = element_rect(fill =
   "white") ) + ggtitle(names(df_pca[RN]))

   plot(pca_plot)

   }

     Note: Color and shape can be manually adjusted based on your data.

     * 12.
       Union the top 500 genes from the first and second PCs of each
       immune cell type – referred to as “HVGs” herein.

   # Union top 500 genes from PC1 and PC2 from all cell types

   HVGs_each_celltype <- list()

   for (RN in 1:length(pca_out)) {

    HVGs_each_celltype[[RN]] <-
   union(rownames(data.frame(sort(abs(pca_out[[RN]]$rotation[,"PC1"]),
   decreasing=TRUE)[1:500])) ,
   rownames(data.frame(sort(abs(pca_out[[RN]]$rotation[,"PC2"]),
   decreasing=TRUE)[1:500])))

   }

   names(HVGs_each_celltype) <- names(pca_out)

   # unique HVGs based on number of cell types

   HVGs <- unique(c(HVGs_each_celltype[[1]] , HVGs_each_celltype[[2]] ,
   HVGs_each_celltype[[3]] , HVGs_each_celltype[[4]]))

     Inline graphic CRITICAL: Edit the unique() function based on the
     numbers of cell types in your dataset. In this case, we investigated
     four major immune cell types: monocytes, NK cells, T cells, and B
     cells.

     Note: To find the optimal numbers of top genes that exhibit high
     variations in PCs, users can construct and investigate a histogram
     plot where the y-axis represents PC loading calculated from the
     prcomp() function in R, and the x-axis shows the numbers of genes
     ([106]Figure S1).

   Example code.

   # Extract PC loading values calculated from prcomp()

   PC1_mono <- sort(abs(pca_out[[1]]$rotation[,"PC1"]), decreasing=TRUE)

   PC1_NK <- sort(abs(pca_out[[2]]$rotation[,"PC1"]), decreasing=TRUE)

   PC1_Tcells <- sort(abs(pca_out[[3]]$rotation[,"PC1"]), decreasing=TRUE)

   PC1_Bcells <- sort(abs(pca_out[[4]]$rotation[,"PC1"]), decreasing=TRUE)

   # Plot

   plot(density(PC1_mono)$y, density(PC1_mono)$x,type="l" , col = "orange"
   , ylab = "PC1 loading values", xlab = "Number of genes" , main = " " ,
   ylim = c(0,0.02) , xlim = c(0, 2000))

   # Add lines

   lines(density(PC1_NK)$y, density(PC1_NK)$x,type="l" , col = "red" ,
   lwd=1)

   lines(density(PC1_Tcells)$y, density(PC1_Tcells)$x,type="l" , col =
   "blue" , lwd=1)

   lines(density(PC1_Bcells)$y, density(PC1_Bcells)$x,type="l" , col =
   "green" , lwd=1)

   # Add a legend

   legend(1550, 0.020, legend = c("Monocytes", "NK cells", "T cells" , "B
   cells"), fill=c( "orange","red","blue","green" ) , box.lty=0 )

   # Add vertical dashed blue line at x = 500

   abline(v=500, col="blue" , lty = "dashed")
     * 13.
       Perform a pathway enrichment analysis of the HVGs of all cell types
       using gProfiler2.[107]^13
          + a.
            All the genes in the genome are used as the background gene
            set.
          + b.
            Perform the pathway analysis using the gost() function.

   # Pathway enrichment analysis

   # Extract all genes that will be used as the background for pathway
   analysis

   bg <- rownames(Avg_expression_list[[1]])

   # Pathway enrichment analysis using gProfiler2

   GO_out <- gost(query = HVGs , organism = "hsapiens" ,
   correction_method = "fdr" , custom_bg = bg , significant = TRUE ,
   user_threshold = 0.05 , evcodes = TRUE , sources = "GO:BP")

     Note: Beside gProfiler2,[108]^13 alternatively, functional
     enrichment analysis can be performed using several computational/web
     tools such as clusterProfiler,[109]^23 Gene Ontology
     Consortium,[110]^24 Database for Annotation, Visualization and
     Integrated Discovery (DAVID),[111]^25 Kyoto Encyclopedia of Genes
     and Genomes (KEGG),[112]^26 and Reactome.[113]^27

     * 14.
       Save your outputs (optional).

   # Save outputs

   saveRDS(Avg_expression_list, file =
   "PATH/TO/YOUR/WORKING/DIRECTORY/Avg_expression_list.rds")

   saveRDS(GO_out , file = "PATH/TO/YOUR/WORKING/DIRECTORY/GO_out.rds")

   write.csv(HVGs , file = "PATH/TO/YOUR/WORKING/DIRECTORY/HVGs.csv",
   row.names = F)

   write.csv(bg , file = "PATH/TO/YOUR/WORKING/DIRECTORY/bg.csv",
   row.names = F)

Investigating dynamic expression patterns of HVGs across all time points and
cell types

     Inline graphic Timing: ∼5–10 min (for steps 15 to 23)

   In this section, we describe the normalization method used to obtain
   relative expression levels of HVGs associated with biological pathways
   of interest, across four different time points as well as cell types
   ([114]Figure 2B). The integrated Seurat object, average gene expression
   levels in each cell type, and the selected GO:BP (biological process
   gene ontology) terms from the pathway enrichment analysis are used as
   the inputs of this step.
     * 15.
       Import the integrated Seurat object, average gene expression levels
       in each cell type, and biological pathways of interest into R.

   # Load libraries

   library(Seurat)

   library(gprofiler2)

   library(ggplot2)

   library(tidyverse)

   # Set your working directory, pointing to the folder where all your
   input and output files will be saved

   setwd("PATH/TO/YOUR/WORKING/DIRECTORY")

   # Load objects

   sc_integrated <- readRDS(file = file =
   "PATH/TO/YOUR/WORKING/DIRECTORY/sc_integrated.rds")

   Avg_expression_list <- readRDS(file =
   "PATH/TO/YOUR/WORKING/DIRECTORY/Avg_expression_list.rds")

   selected_GO_out <- readRDS(file =
   "PATH/TO/YOUR/WORKING/DIRECTORY/selected_GO_out.rds")

   selected_GO_out <- selected_GO_out$result
     * 16.
       Calculate the mean value of each gene across all cells from the
       integrated Seurat object.

   # Calculate the mean value of each gene from all cells

   exp_matrix <- GetAssayData(sc_integrated , slot = "data" , assay =
   "RNA") %>% data.frame() %>% rownames_to_column()

   rownames(exp_matrix) <- exp_matrix$rowname

   exp_matrix$rowname <- NULL

   exp_matrix$Mean <- rowMeans(exp_matrix)

   exp_matrix$Gene <- rownames(exp_matrix)

   Mean_all_cells <- exp_matrix[c("Gene" , "Mean")]
     * 17.
       Create a list of HVGs in each “Gene Ontology: Biological processes
       (GO:BP)”.

   # Create a list of HVGs in each GO:BP

   genes_each_GO <- list()

   for (RN in 1:nrow(selected_GO_out)) {

    temp <-
   strsplit(unique(paste(as.character(selected_GO_out$intersection[RN]) ,
   sep = ",")) , ",")

    genes_each_GO[[RN]] <- temp[[1]]

   }

   names(genes_each_GO) <- selected_GO_out$term_name
     * 18.
       Extract the average gene expression values of HVGs in each selected
       GO:BP.

   # Extract average gene expression values of HVGs

   Avg_expression_each_GO <- list()

   temp_list <- list()

   for (RN in 1:length(Avg_expression_list)) {

    for (JA in 1:length(genes_each_GO)) {

     temp_list[[JA]] <-
   Avg_expression_list[[RN]][rownames(Avg_expression_list[[RN]]) %in%
   genes_each_GO[[JA]],]

     names(temp_list)[[JA]] <- names(genes_each_GO)[[JA]]

    }

   Avg_expression_each_GO[[RN]] <- temp_list

   }

   names(Avg_expression_each_GO) <- names(Avg_expression_list)
     * 19.
       Normalize each gene by adding a pseudocount of 1 (to avoid
       undefined results from a zero denominator – in unexpressed genes),
       and then divide by the mean value from all cells.

   # Normalise each gene by adding a pseudocount of 1, and divide by mean
   value from all cells.

   Avg_expression_each_GO_norm <- list()

   for (RN in 1:length(Avg_expression_each_GO)) {

    temp <- Avg_expression_each_GO[[RN]]

    for (JA in 1:length(temp)) {

     temp_1 <- Mean_all_cells[Mean_all_cells$Gene %in%
   rownames(temp[[JA]]),]

     temp_1

     temp[[JA]] <- (temp[[JA]] + 1) / (temp_1$Mean + 1)

    }

    Avg_expression_each_GO_norm[[RN]] <- temp

   }

   names(Avg_expression_each_GO_norm) <- names(Avg_expression_list)
     * 20.
       Calculate the sum of the normalized average gene expression of all
       HVGs belonging to each GO:BP of interest.

   # Sum of each gene in each GO:BP

   Sum_avg_expression_each_GO <- list()

   for (RN in 1:length(Avg_expression_each_GO_norm)) {

    temp <- Avg_expression_each_GO_norm[[RN]]

    for (JA in 1:length(temp)) {

     temp[[JA]][names(Avg_expression_each_GO_norm[RN]),] <-
   colSums(temp[[JA]])

    }

    Sum_avg_expression_each_GO[[RN]] <- temp

   }

   names(Sum_avg_expression_each_GO) <- names(Avg_expression_each_GO_norm)
     * 21.
       Create a big dataframe, where each row represents each GO:BP and
       each column represents each cell type in each condition.

   # Create list of summation each GO:BP in each cell type

   Sum_exp_each_GO_celltype <- list()

   for (RN in 1:length(genes_each_GO)) {

    Sum_exp_each_GO_celltype[[RN]] <-
   rbind(tail(Sum_avg_expression_each_GO[[1]][[RN]] , 1) ,

     tail(Sum_avg_expression_each_GO[[2]][[RN]] , 1) ,

     tail(Sum_avg_expression_each_GO[[3]][[RN]] , 1) ,

     tail(Sum_avg_expression_each_GO[[4]][[RN]] , 1))

   }

   names(Sum_exp_each_GO_celltype) <- names(genes_each_GO)

   # Create a big dataframe, each row represents each GO:BP, and each
   column represents each cell type in each condition.

   # matrix_size = 1 : (no_of_cell_types x biological_conditions)

   no_cell_types <- 4

   no_samples <- 10

   matrix_size <- no_cell_types ∗ no_samples

   Input_dot_plot <- as.data.frame(matrix(1:matrix_size, nrow = 1, ncol =
   matrix_size))

   colnames(Input_dot_plot) <- c(paste("Monocytes" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("NK" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("T" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("B" ,
   names(Sum_exp_each_GO_celltype[[1]])))

   for (RN in 1:length(genes_each_GO)) {

    temp <- Sum_exp_each_GO_celltype[[RN]]

    temp <- cbind(temp[1,],temp[2,] , temp[3,] , temp[4,])

    colnames(temp) <- c(paste("Monocytes" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("NK" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("T" ,
   names(Sum_exp_each_GO_celltype[[1]])) , paste("B" ,
   names(Sum_exp_each_GO_celltype[[1]])))

    Input_dot_plot <- rbind(Input_dot_plot , temp)

   }

   Input_dot_plot <- Input_dot_plot[-1,]

   rownames(Input_dot_plot) <- names(genes_each_GO)
     * 22.
       Calculate the z-score across all samples and cell types for
       visualization ([115]Figure 2B).

   # For visualization, calculate z-score by row

   Input_dot_plot_zscore <-
   t(apply((Input_dot_plot[,1:length(Input_dot_plot)]), 1, function(x){

    mean <- mean(x)

    SD <- sd(x)

    Z_score <- (x-mean)/SD

    Z_score

   }))

   Input_dot_plot_zscore <- as.data.frame(Input_dot_plot_zscore)

   # Convert to long format for plot

   forplot <- gather(Input_dot_plot_zscore %>% rownames_to_column("GO"),
   key = "ST" , value = "Expression" , -GO)

   forplot <- forplot %>% separate(col = "ST" , into = c("Cell_Types" ,
   "Patients" , "time"),sep = " " , remove = F)

   forplot$time <- factor(forplot$time , levels = c("Day-2" , "Day-1" ,
   "Def" , "Wk2" , "I" , "II"))

   forplot$Cell_Types <- factor(forplot$Cell_Types , levels =
   c("Monocytes" , "NK" , "T" , "B"))

   forplot$Patients <- factor(forplot$Patients , levels = c("DF" , "DHF" ,
   "Healthy"))
     * 23.
       Visualize the results using the ggplot2 package.[116]^14

   # Visualize the relative expression of HVGs over all samples and cell
   types

   ggplot(forplot, aes(x=time, y=GO , color= Expression , size =
   Expression)) + geom_point(alpha = 1.5) + theme_classic() +
   scale_colour_gradient2( low = "blue", mid = "white", high = "red",
   space = "Lab" , limits =
   c(-max(forplot$Expression),max(forplot$Expression)) ) + xlab("") +
   scale_size_continuous(range = c(5,5)) + facet_grid(∼Cell_Types+Patients
   , scales = "free", space = "free") + labs(color = paste("z-score")) +
   theme(axis.text.x = element_text(angle = 60, hjust=1),
   axis.text=element_text(size=20) , axis.title.y = element_text(size=25),
   strip.background = element_rect(colour = "white", fill =
   c("gray","darkorange3", "gray")), legend.text = element_text(size =
   15), legend.title = element_text(size = 20),legend.key.size = unit(1,
   "cm")) + ylab("Significant GO terms")

HVG identification and pathway enrichment analysis across early and late time
points in immune cell types - COVID-19 case study

     Inline graphic Timing: ∼ 10–15 min

   In addition to the dengue data published in our earlier study,[117]^1
   here we also demonstrate the application of this framework on another
   time-course scRNA-seq study. We retrieved processed scRNA-seq data from
   a COVID-19 study,[118]^10 which was deposited on FASTGenomics
   ([119]https://www.fastgenomics.org/). We selected two COVID-19 patients
   (“cohort 1”) with the donor IDs "C19-CB-0009″ and "C19-CB-0012″, whose
   samples were obtained at the same “days after symptom onset”; together
   with two healthy donors, "P15F″ and "P17H". We used the same cell type
   annotation as initially characterized by the authors of the
   study.[120]^10

     Note: Using our framework to analyze this COVID-19 dataset, we
     observed distinct overall transcriptome profiles between early and
     late phases of the infection in each of the cell types of interest,
     while the differences between the two patients were relatively small
     ([121]Figure 3A). Moreover, the overall expression patterns of the
     two healthy donors were nicely grouped together and clearly
     separated from the COVID-19 patients at all the time points
     ([122]Figure 3A). We next identified HVGs and virtualized their
     dynamic expression patterns, for different groups of biological
     pathways that the HVGs are associated with ([123]Figure 3B).
     Interestingly, we observed clear up-regulation of the HVGs in the
     early infection associated with “response to type I interferon”,
     “interferon-mediated signaling pathway”, and “type I
     interferon-mediated signaling pathway” in monocytes ([124]Figure 3B,
     labeled in blue), which have also been described in the original
     paper,[125]^10 but using a different analytical framework.[126]^10
     In addition to this, our analysis and visualization also revealed
     cell-type-specific expression dynamics of the HVGs associated with
     certain biological pathways, such as “natural killer cell mediated
     cytotoxicity” HVGs being specifically upregulated in NK cells; “T
     cell receptor V(D)J recombination” HVGs being expanded in CD4+ and
     CD8+ T cells, and several B cells-related pathways such as “B cell
     activation”, “humoral immune response”, and “B cell mediated
     immunity“ being up-regulated specifically in B cells at the early
     infection ([127]Figure 3B), all of which have not been mentioned in
     the original study.[128]^10

Figure 3.

   [129]Figure 3
   [130]Open in a new tab

   The relative expression values of HVGs in each biological process (BP)
   across two COVID-19 patients during the early time point (d9; 9 days
   after symptom onset) and late time point (d16; 16 days after symptom
   onset), together with two healthy controls

   (A) PCA of average gene expression in each cell type. Different colors
   represent the days after symptom onset, while different shapes
   represent individual samples.

   (B) Dotplot showing the relative expression levels of HVGs related to
   each BP. The time-specific BPs described in the original paper[131]^10
   are labeled in blue. The NK-specific BP is highlighted in purple,
   T-specific BP is in dark red, and B-specific BPs are in green. P1 =
   covid-19 patient 1; P2 = covid-19 patient 2; HC = healthy controls.
   d9 = 9 days after symptom onset; d16 = 16 days after symptom onset.

Expected outcomes

   We propose an approach for identification of highly variable genes
   (HVGs) and investigation of their expression patterns across multiple
   time points (or biological conditions). Similarly to differentially
   expressed genes (DEGs) that are commonly used to represent the genes
   that might be activated or repressed differentially between two
   biological conditions, such as between patients vs.’ healthy controls,
   or treatment vs.’ normal controls, HVGs depict the genes that show the
   highest dynamics across multiple time points, and their functions might
   be linked to temporally specific biological processes over the course
   of the studies.

   The overall variations between multiple transcriptomes obtained in
   different biological conditions can be consolidated using Principal
   Component Analysis (PCA) and then visualized in a two-dimensional plot.
   With the scRNA-seq data, we can readily obtain and investigate the PCA
   plots of different cell types and subpopulations ([132]Figures 2A and
   [133]3A). HVGs of different cell types are also identified during this
   PCA process, and used as the inputs of pathway enrichment analysis.
   Relative expression patterns across time points of HVGs associated with
   biological processes of interest are then computed and visualized
   across different cell types in a single figure, as shown in
   [134]Figures 2B and [135]3B. As demonstrated using the dengue[136]^1
   and COVID-19[137]^10 case studies, we can identify the biological
   processes whose HVGs demonstrate cell-type-specific expression dynamics
   across the time points, and hence suggest the links between time-point
   specific gene expression regulation and relevant biological processes
   in each of the cell types of interest. In addition to scRNA-seq, we
   expect that the framework will be adaptable for the investigation of
   temporal or longitudinal gene expression datasets obtained using other
   high-throughput and omic technologies, such as time-course bulk RNA-seq
   or proteomic studies.

Limitations

   Batch effect can occur when the samples obtained using different
   technologies or platforms. In the protocol described here, all the
   samples were processed using the same versions of the 10x genomics
   single-cell kits. Hence, we did not observe technical batch effects. If
   multiple samples used in the analysis are generated by different
   techniques, it is advisable to inspect the batch effect by PCA before
   starting the HVG analysis. Library preparation techniques could appear
   as the factor most affecting the grouping of the samples. In such
   cases, it may not be possible to obtain HVGs that truly represent the
   most variances from biological conditions using our described
   protocols. It might be possible; however, to regress such batch effects
   using tools such as Harmony,[138]^28 Mutual Nearest Neighbors
   (MNN),[139]^29 and Seurat Canonical Correlation Analysis (CCA).[140]^30

Troubleshooting

Problem 1

   Can this protocol be used with a .h5ad input file for the HVG
   identification (related to step 9)?

Potential solution

   A .h5ad file can be converted into a Seurat object, and the average
   gene expression can be calculated using the codes below.

   Convert a .h5ad file to a Seurat object.

   library(zellkonverter)

   sce <- readH5AD("PATH/TO/file.h5ad", verbose = TRUE)

   seurat_obj <- as.Seurat(sce, counts = "X", data = NULL)

   Calculate average gene expression levels in each cell type:

   Avg_expression_list <- list()

   for (RN in 1:length(each_celltype_list)) {

    Idents(each_celltype_list[[RN]]) <- "ST"

    Avg_expression_list[[RN]] <-
   AverageExpression(each_celltype_list[[RN]] , assays = "originalexp" ,
   slot = "data")

    Avg_expression_list[[RN]] <-
   as.data.frame(Avg_expression_list[[RN]]$originalexp)

   }

   names(Avg_expression_list) <- names(each_celltype_list)

Problem 2

   The code described in step 17 is specific for the output generated from
   the gProfiler2 package.[141]^13 If different functional enrichment
   tools are used, the outputs might be different from the gProfiler2 one,
   which could result in an error when running step 17.

Potential solution

   Users can alternatively create a list of GO terms themselves, as shown
   below. Each GO term in a list should contain genes that are related to
   that particular GO term. After creating this list, users should be able
   to continue with step 18.

   An example of a list,

   [142]graphic file with name fx4.jpg
   [143]Open in a new tab

Problem 3

   The “ambiguous” gene names are sometimes not recognized by the
   gProfiler2 package[144]^13 (related to step 13).

Potential solution

   We encourage users to use the Ensembl IDs as default inputs.

Problem 4

   Large R, Rmd scripts, or R objects sometimes cannot be loaded and/or
   run on Rstudio (related to HVG identification and pathway enrichment
   analysis and investigating dynamic expression patterns of HVGs across
   all time points and cell types sections).

Potential solution

   We encourage users to run the scripts using .R via R or RServer with
   less Graphic User Interface (GUI) instead.

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the Lead contact, Varodom
   Charoensawan (Email:
   [145]varodom.cha@mahidol.ac.th).[146]jantarikaarora@gmail.com,
   [147]anunya.opa@mahidol.edu

Materials availability

   This study did not generate new unique materials.

Acknowledgments