Graphical abstract graphic file with name fx1.jpg [34]Open in a new tab Highlights * • Highly variable genes (HVGs) as an alternative to DEGs for time-series scRNA-seq data * • Visualizing dynamic expression patterns of HVGs over multiple cell types in one figure * • Exploring biological pathways with common and cell-type-specific expression dynamics __________________________________________________________________ Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics. __________________________________________________________________ Here, we present a computational approach for investigating highly variable genes (HVGs) associated with biological pathways of interest, across multiple time points and cell types in single-cell RNA-sequencing (scRNA-seq) data. Using public dengue virus and COVID-19 datasets, we describe steps for using the framework to characterize the dynamic expression levels of HVGs related to common and cell-type-specific biological pathways over multiple immune cell types. Before you begin Overview Investigation of differentially expressed genes (DEGs) between two biological conditions or paired samples (e.g., control vs.’ treatment, or healthy vs.’ disease samples) is one of the key steps in genome-wide gene expression analyses, such as transcriptomic and proteomic studies. With single-cell RNA-sequencing (scRNA-seq) technologies, we are now able to look into transcriptomic profiles of individual cells, and hence perform DEG analyses on different cell types and suppopulations. Alternatively, one can also characterize a set of genes that are differentially expressed between cell types, frequently referred to as marker genes. However, most current computational methods were intended for the comparison between two conditions of interest and/or on specific cell type at the time,[35]^2^,[36]^3^,[37]^4^,[38]^5^,[39]^6^,[40]^7^,[41]^8^,[42]^9 but not for investigating gene expression dynamics across more than two conditions, such as in a time-course dataset. Here, we present a framework for identifying and investigating highly variable genes (HVGs) that demonstrate highly dynamic expression patterns across several time points (or biological conditions). Using our workflow, one can visualize relative expression levels of HVGs associated with biological pathways of interest in multiple time points and across different cell types in one analysis. Using publicly available time-course scRNA-seq datasets of dengue[43]^1 and SARS-CoV-2[44]^10 infections, we have demonstrated how the protocol can be used to extract HVGs, their biological processes, and gene expression patterns that are unique to certain cell types, or shared across multiple populations of cells. Install tools/packages * 1. Installation of R ([45]https://www.r-project.org/), and optionally, RStudio ([46]https://www.rstudio.com/). * 2. Installation of the relevant R packages. + a. All the R packages required from the analyses are listed under the Software and Algorithms section of the [47]key resources table. + b. Users can also download the Docker Hub, URL: [48]https://hub.docker.com/r/jantarika/rstudio_denguetimecours e, that contains all the R packages used in this protocol. Download scRNA-seq datasets * 3. Download the datasets, which was used as an example in this protocol. + a. Download the datasets mentioned in the Deposited Data section of the [49]key resources table. + b. Alternatively, the complete datasets used for this analysis are also deposited to Mendeley Data: [50]https://data.mendeley.com/datasets/6ry3x7r8hf/3. Institutional permission The data described in this article were originally published by Arora and Opasawatchai and colleagues,[51]^1 which was approved by the Institutional Review Boards of Faculty of Medicine Vajira Hospital (No.015/12), Faculty of Tropical Medicine Mahidol University (TMEC 13041) and Faculty of Medicine, Ramathibodi Hospital, Mahidol University (MURA2019/603). Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Deposited data __________________________________________________________________ Dataset: Raw sequencing data of 4 time-points from DF and DHF patients and a healthy donor Arora et al.[52]^1 ArrayExpress: E-MTAB-9467 Dataset: 4k PBMCs from a healthy donor 10x Genomics Single Cell Gene Expression Datasets [53]https://support.10xgenomics.com/single-cell-gene-expression/dataset s/2.1.0/pbmc4k Complete datasets used for this protocol Mendeley Data [54]https://data.mendeley.com/datasets/6ry3x7r8hf/3 Dataset: Processed scRNA-seq datasets of COVID-19 PBMC samples Schulte-Schrepping et al.[55]^10 [56]http://fastgenomics.org Algorithms and computer codes, and the version record Arora et al.[57]^1 [58]https://github.com/vclabsysbio/scRNAseq_DVtimecourse [59]https://doi.org/10.5281/zenodo.7968936 __________________________________________________________________ Software and algorithms __________________________________________________________________ RStudio 4.0.2 RStudio [60]https://www.rstudio.com/ Seurat v3.1.2 Stuart et al.[61]^2 [62]https://satijalab.org/seurat/ SoupX v1.0.1 Young and Behjati.[63]^11 [64]https://github.com/constantAmateur/SoupX DoubletFinder v2.0.3 McGinnis et al.[65]^12 [66]https://github.com/chris-mcginnis-ucsf/DoubletFinder gProfier2 v0.1.8 Raudvere et al.[67]^13 [68]https://biit.cs.ut.ee/gprofiler/ ggplot2 v3.3.2 Wickham.[69]^14 [70]https://ggplot2.tidyverse.org/ tidyverse v.2.0.0 Wickham et al.[71]^15 [72]https://www.tidyverse.org/packages/ [73]Open in a new tab Materials and equipment Bioinformatic analyses All the bioinformatic analyses presented here were tested on an in-house server (Intel Core Processor (Broadwell, IBRS), 39 CPUs, and 480 GB RAM.), running on Ubuntu 16.04.6 LTS. Step-by-step method details Before identification and investigation of time-course HVGs and their dynamic expression profiles, the scRNA-seq data must be pre-processed and quality controlled, which are summarized briefly here, and are described in details through GitHub ([74]https://github.com/vclabsysbio/scRNAseq_DVtimecourse). Note that these steps have also been comprehensively described elsewhere.[75]^16^,[76]^17^,[77]^18 QC and filtering single-cell RNA-seq data Inline graphic Timing: ∼ 1–2 h/sample (for steps 1 to 3) The quality control and filtering scRNA-seq data were performed separately on individual samples, before data integration and cell type annotation ([78]Figure 1). The QC stage comprises correcting of ambient RNAs, exclusion of dead and low quality cells and potential doublets, in prior to the downstream bioinformatic analyses. Raw and filtered expression matrices generated by the cellranger count function are required as the inputs of this step - see [79]key resources table. Note:[80]Figure 1 illustrates the overview of the bioinformatic pipeline described in this protocol. Note: R packages and codes required for the QC and filtering step are provided in Github URL: [81]https://github.com/vclabsysbio/scRNAseq_DVtimecourse. Complete datasets used for this step are deposited in Mendeley Data: [82]https://data.mendeley.com/datasets/6ry3x7r8hf/3. * 1. Correct ambient RNAs using SoupX.[83]^11 Note: Check for the expression levels of ambient RNAs in your datasets. If an excessive amount is observed, users can apply several ambient RNA removal tools such as SoupX[84]^11 to correct their expression levels. For more details of predicting and correcting the expression values of ambient RNAs, please refer to the SoupX tutorial.[85]^11 * 2. Exclude the cells expressing excessive mitochondrial genes, in this example, more than 10% in the total transcript counts. Note: Check the distribution of percent mitochondrial genes (percent.mt) to determine suitable cut-offs in your scRNA-seq datasets. For details, please refer to the Seurat toolkit.[86]^2 * 3. Predict and discard doublets using doubletFinder.[87]^12 Note: For droplet-based scRNA-seq, we recommend excluding “doublets” and “multiplets”. For more details about the optimal cut-offs in each parameter, please refer to doubletFinder.[88]^12 Figure 1. [89]Figure 1 [90]Open in a new tab Overview of bioinformatic pipeline of scRNA-seq data analysis described in this protocol HVGs, highly variable genes. Data normalization, integration, and cell type annotation Inline graphic Timing: ∼ 2 h (for steps 4 to 7) After filtering out low quality cells and correcting the expression values of ambient RNAs, the data are used as inputs for normalization, integration, and cell type annotation. Seurat objects of individual samples after QC and filtering step are used as the inputs of this step. * 4. Run the standard preprocessing workflow for each individual sample. + a. Import and create a list of individual Seurat objects after the QC and filtering step. + b. Run the SCTransform function, a regularized negative binomial regression.[91]^19 + c. Apply the first 30 principal components (PCs) for cell clustering. + d. Identify cell clusters by the shared nearest neighbor (SNN) method using the Louvain algorithm with multilevel refinement. * 5. Integrate all the samples. We used the following settings in our examples here: three thousand (3,000) features, a Louvain algorithm with multi-level refinement for clustering and the resolution of three. Other parameters are as default. * 6. Normalize the transcript counts of each cell using the NormalizeData() function. * 7. Identify major immune cell types using known canonical marker genes (as described in [92]Table S1). Note: For cell type annotation, in our examples we first labeled different clusters of T cells with the same name in order to group them together as a single “T cells” cluster, which was subsequently re-clustered into subpopulations. Note: High heterogeneity of PBMCs, especially at the subpopulation levels, might be difficult to identify using the known marker genes ([93]Table S1). Alternatively, reference-based for cell type annotation such as SingleR,[94]^20 Azimuth,[95]^21 and ScType[96]^22 can be applied. Highly variable gene (HVG) identification and pathway enrichment analysis across multiple time points – Dengue case study Inline graphic Timing: ∼10–15 min (for steps 8 to 14) Here, we describe the process for obtaining HVGs, which represent the variations in transcriptional levels across several biological conditions (typically >2 conditions, in our case, four time points from two DENV-infected patients and two healthy controls) in multiple immune cell types of interest. The inputs of this step are the integrated scRNA-seq Seurat objects from the previous pro-processing and cell type annotation described above ([97]Figure 1). Note: We first performed a pseudo-bulk expression analysis by calculating the average transcription levels of all the genes. Then, principal component analysis (PCA) of the averaged gene expression in each cell type was performed ([98]Figure 2A). This PCA of each cell type was used to extract the genes that demonstrate the largest variation across biological conditions, – to be referred to as “Highly Variable Genes (HVGs)”. In this example, the most apparent differences are between the four time points, whereas those between the two DENV patients are relatively small. Finally, we investigated the biological pathways associated with the HVGs that show common and unique expression dynamics among different cell types and biological conditions ([99]Figure 2B). The workflow is described step-by-step below. Note: Processed scRNA-seq dataset used in this step is deposited in Mendeley Data: [100]https://data.mendeley.com/datasets/6ry3x7r8hf/3 Figure 2. [101]Figure 2 [102]Open in a new tab The relative expression levels of highly variable genes (HVGs) in each biological process (BP) of interest, across the four time points and major immune cell types, from two dengue-infected patients and two healthy controls (A and B) Figures were modified from Arora et al., 2022[103]^1 (A). PCA of average gene expression in each major immune cell type of interest. (B). Dotplot representing relative expression levels of HVGs. The BPs of the HVGs across the four major immune cell types of interest here are highlighted in red texts, monocyte-specific BPs are in blue, and B cell-specific BPs are in green. * 8. Subset the integrated Seurat object according to their major cell types into monocytes, NK cells, T cells, and B cells/plasma cells/plasmablasts - the main immune cell types of interest in this study. # Load libraries library(Seurat) library(gprofiler2) library(ggplot2) # Set your working directory, pointing to the folder where all your input and output files will be saved setwd("PATH/TO/YOUR/WORKING/DIRECTORY") # Load integrated data sc_integrated <- readRDS(file = "PATH/TO/YOUR/WORKING/DIRECTORY/sc_integrated.rds") # Subset each cell type Idents(sc_integrated) <- "Cell_Types" each_celltype_list <- list() each_celltype <- c("Monocytes" , "NK cells" , "T cells" , "B cells") for (RN in each_celltype) { each_celltype_list[[RN]] <- subset(sc_integrated , idents = RN) } names(each_celltype_list) <- each_celltype * 9. Calculate the average gene expression levels in each cell type across biological conditions of interest. + a. Set the object’s identity class based on condition of interest using the Idents() function. + b. Calculate the average gene expression using the AverageExpression() function. # Calculate average gene expression in each cell type across severity and time Avg_expression_list <- list() for (RN in 1:length(each_celltype_list)) { Idents(each_celltype_list[[RN]]) <- "ST" Avg_expression_list[[RN]] <- AverageExpression(each_celltype_list[[RN]] , assays = "RNA" , slot = "data") Avg_expression_list[[RN]] <- as.data.frame(Avg_expression_list[[RN]]$RNA) } names(Avg_expression_list) <- names(each_celltype_list) * 10. Construct the principal components (PCs) of the average gene expression levels in each cell type using the prcomp() function in R ([104]Figure 2A). # Construct Principal Components (PCs) in each cell type pca_out <- list() pca_perc <- list() df_pca <- list() for (RN in 1:length(Avg_expression_list)) { pca_out[[RN]] <- prcomp(t(Avg_expression_list[[RN]])) pca_perc[[RN]] <- round(100∗pca_out[[RN]]$sdevˆ2/sum(pca_out[[RN]]$sdevˆ2),1) df_pca[[RN]] <- data.frame(PC1 = pca_out[[RN]]$x[,1], PC2 = pca_out[[RN]]$x[,2], sample = colnames(Avg_expression_list[[RN]])) # Add metadata can be differences in each dataset df_pca[[RN]]$Severity <- c(rep("DF" , 4) , rep("DHF" , 4) , "Healthy" , "Healthy") df_pca[[RN]]$Time <- c("Day-2" , "Day-1" , "Def" , "Wk2" , "Day-2" , "Day-1" , "Def" , "Wk2" ,"Healthy I" , "Healthy II") df_pca[[RN]]$Time <- factor(df_pca[[RN]]$Time , levels = c( "Day-2" , "Day-1" , "Def" , "Wk2" , "Healthy I" , "Healthy II")) } names(pca_out) <- names(Avg_expression_list) names(pca_perc) <- names(Avg_expression_list) names(df_pca) <- names(Avg_expression_list) Note: Please add metadata based on your datasets. * 11. Visualize the PCA results using the ggplot2 package.[105]^14 # Visualize PCA results for (RN in 1:length(df_pca)) { pca_plot<- ggplot(df_pca[[RN]], aes(PC1,PC2, color = Time))+ geom_point(aes(shape = Severity ), size=6 , stroke = 1.4)+ labs(x=paste0("PC1 (",pca_perc[[RN]][1],"%)"), y=paste0("PC2 (",pca_perc[[RN]][2],"%)")) + scale_color_manual(values=c("darkgoldenrod2", "#ff7400","#ff1218", "#47849c" , "darkgrey" , "gray6")) + theme(axis.text = element_text(size = 17 , face="bold" , colour = "black") , axis.title.y = element_text(color="black", size=15, face="bold") , axis.title.x = element_text(color="black", size=17, face="bold") , legend.title = element_text(face = "bold" , size = 17) , legend.text = element_text(size = 16) , legend.key.size = unit(1, "cm") , legend.key.width = unit(0.5,"cm") , legend.key = element_rect(fill = "white") ) + ggtitle(names(df_pca[RN])) plot(pca_plot) } Note: Color and shape can be manually adjusted based on your data. * 12. Union the top 500 genes from the first and second PCs of each immune cell type – referred to as “HVGs” herein. # Union top 500 genes from PC1 and PC2 from all cell types HVGs_each_celltype <- list() for (RN in 1:length(pca_out)) { HVGs_each_celltype[[RN]] <- union(rownames(data.frame(sort(abs(pca_out[[RN]]$rotation[,"PC1"]), decreasing=TRUE)[1:500])) , rownames(data.frame(sort(abs(pca_out[[RN]]$rotation[,"PC2"]), decreasing=TRUE)[1:500]))) } names(HVGs_each_celltype) <- names(pca_out) # unique HVGs based on number of cell types HVGs <- unique(c(HVGs_each_celltype[[1]] , HVGs_each_celltype[[2]] , HVGs_each_celltype[[3]] , HVGs_each_celltype[[4]])) Inline graphic CRITICAL: Edit the unique() function based on the numbers of cell types in your dataset. In this case, we investigated four major immune cell types: monocytes, NK cells, T cells, and B cells. Note: To find the optimal numbers of top genes that exhibit high variations in PCs, users can construct and investigate a histogram plot where the y-axis represents PC loading calculated from the prcomp() function in R, and the x-axis shows the numbers of genes ([106]Figure S1). Example code. # Extract PC loading values calculated from prcomp() PC1_mono <- sort(abs(pca_out[[1]]$rotation[,"PC1"]), decreasing=TRUE) PC1_NK <- sort(abs(pca_out[[2]]$rotation[,"PC1"]), decreasing=TRUE) PC1_Tcells <- sort(abs(pca_out[[3]]$rotation[,"PC1"]), decreasing=TRUE) PC1_Bcells <- sort(abs(pca_out[[4]]$rotation[,"PC1"]), decreasing=TRUE) # Plot plot(density(PC1_mono)$y, density(PC1_mono)$x,type="l" , col = "orange" , ylab = "PC1 loading values", xlab = "Number of genes" , main = " " , ylim = c(0,0.02) , xlim = c(0, 2000)) # Add lines lines(density(PC1_NK)$y, density(PC1_NK)$x,type="l" , col = "red" , lwd=1) lines(density(PC1_Tcells)$y, density(PC1_Tcells)$x,type="l" , col = "blue" , lwd=1) lines(density(PC1_Bcells)$y, density(PC1_Bcells)$x,type="l" , col = "green" , lwd=1) # Add a legend legend(1550, 0.020, legend = c("Monocytes", "NK cells", "T cells" , "B cells"), fill=c( "orange","red","blue","green" ) , box.lty=0 ) # Add vertical dashed blue line at x = 500 abline(v=500, col="blue" , lty = "dashed") * 13. Perform a pathway enrichment analysis of the HVGs of all cell types using gProfiler2.[107]^13 + a. All the genes in the genome are used as the background gene set. + b. Perform the pathway analysis using the gost() function. # Pathway enrichment analysis # Extract all genes that will be used as the background for pathway analysis bg <- rownames(Avg_expression_list[[1]]) # Pathway enrichment analysis using gProfiler2 GO_out <- gost(query = HVGs , organism = "hsapiens" , correction_method = "fdr" , custom_bg = bg , significant = TRUE , user_threshold = 0.05 , evcodes = TRUE , sources = "GO:BP") Note: Beside gProfiler2,[108]^13 alternatively, functional enrichment analysis can be performed using several computational/web tools such as clusterProfiler,[109]^23 Gene Ontology Consortium,[110]^24 Database for Annotation, Visualization and Integrated Discovery (DAVID),[111]^25 Kyoto Encyclopedia of Genes and Genomes (KEGG),[112]^26 and Reactome.[113]^27 * 14. Save your outputs (optional). # Save outputs saveRDS(Avg_expression_list, file = "PATH/TO/YOUR/WORKING/DIRECTORY/Avg_expression_list.rds") saveRDS(GO_out , file = "PATH/TO/YOUR/WORKING/DIRECTORY/GO_out.rds") write.csv(HVGs , file = "PATH/TO/YOUR/WORKING/DIRECTORY/HVGs.csv", row.names = F) write.csv(bg , file = "PATH/TO/YOUR/WORKING/DIRECTORY/bg.csv", row.names = F) Investigating dynamic expression patterns of HVGs across all time points and cell types Inline graphic Timing: ∼5–10 min (for steps 15 to 23) In this section, we describe the normalization method used to obtain relative expression levels of HVGs associated with biological pathways of interest, across four different time points as well as cell types ([114]Figure 2B). The integrated Seurat object, average gene expression levels in each cell type, and the selected GO:BP (biological process gene ontology) terms from the pathway enrichment analysis are used as the inputs of this step. * 15. Import the integrated Seurat object, average gene expression levels in each cell type, and biological pathways of interest into R. # Load libraries library(Seurat) library(gprofiler2) library(ggplot2) library(tidyverse) # Set your working directory, pointing to the folder where all your input and output files will be saved setwd("PATH/TO/YOUR/WORKING/DIRECTORY") # Load objects sc_integrated <- readRDS(file = file = "PATH/TO/YOUR/WORKING/DIRECTORY/sc_integrated.rds") Avg_expression_list <- readRDS(file = "PATH/TO/YOUR/WORKING/DIRECTORY/Avg_expression_list.rds") selected_GO_out <- readRDS(file = "PATH/TO/YOUR/WORKING/DIRECTORY/selected_GO_out.rds") selected_GO_out <- selected_GO_out$result * 16. Calculate the mean value of each gene across all cells from the integrated Seurat object. # Calculate the mean value of each gene from all cells exp_matrix <- GetAssayData(sc_integrated , slot = "data" , assay = "RNA") %>% data.frame() %>% rownames_to_column() rownames(exp_matrix) <- exp_matrix$rowname exp_matrix$rowname <- NULL exp_matrix$Mean <- rowMeans(exp_matrix) exp_matrix$Gene <- rownames(exp_matrix) Mean_all_cells <- exp_matrix[c("Gene" , "Mean")] * 17. Create a list of HVGs in each “Gene Ontology: Biological processes (GO:BP)”. # Create a list of HVGs in each GO:BP genes_each_GO <- list() for (RN in 1:nrow(selected_GO_out)) { temp <- strsplit(unique(paste(as.character(selected_GO_out$intersection[RN]) , sep = ",")) , ",") genes_each_GO[[RN]] <- temp[[1]] } names(genes_each_GO) <- selected_GO_out$term_name * 18. Extract the average gene expression values of HVGs in each selected GO:BP. # Extract average gene expression values of HVGs Avg_expression_each_GO <- list() temp_list <- list() for (RN in 1:length(Avg_expression_list)) { for (JA in 1:length(genes_each_GO)) { temp_list[[JA]] <- Avg_expression_list[[RN]][rownames(Avg_expression_list[[RN]]) %in% genes_each_GO[[JA]],] names(temp_list)[[JA]] <- names(genes_each_GO)[[JA]] } Avg_expression_each_GO[[RN]] <- temp_list } names(Avg_expression_each_GO) <- names(Avg_expression_list) * 19. Normalize each gene by adding a pseudocount of 1 (to avoid undefined results from a zero denominator – in unexpressed genes), and then divide by the mean value from all cells. # Normalise each gene by adding a pseudocount of 1, and divide by mean value from all cells. Avg_expression_each_GO_norm <- list() for (RN in 1:length(Avg_expression_each_GO)) { temp <- Avg_expression_each_GO[[RN]] for (JA in 1:length(temp)) { temp_1 <- Mean_all_cells[Mean_all_cells$Gene %in% rownames(temp[[JA]]),] temp_1 temp[[JA]] <- (temp[[JA]] + 1) / (temp_1$Mean + 1) } Avg_expression_each_GO_norm[[RN]] <- temp } names(Avg_expression_each_GO_norm) <- names(Avg_expression_list) * 20. Calculate the sum of the normalized average gene expression of all HVGs belonging to each GO:BP of interest. # Sum of each gene in each GO:BP Sum_avg_expression_each_GO <- list() for (RN in 1:length(Avg_expression_each_GO_norm)) { temp <- Avg_expression_each_GO_norm[[RN]] for (JA in 1:length(temp)) { temp[[JA]][names(Avg_expression_each_GO_norm[RN]),] <- colSums(temp[[JA]]) } Sum_avg_expression_each_GO[[RN]] <- temp } names(Sum_avg_expression_each_GO) <- names(Avg_expression_each_GO_norm) * 21. Create a big dataframe, where each row represents each GO:BP and each column represents each cell type in each condition. # Create list of summation each GO:BP in each cell type Sum_exp_each_GO_celltype <- list() for (RN in 1:length(genes_each_GO)) { Sum_exp_each_GO_celltype[[RN]] <- rbind(tail(Sum_avg_expression_each_GO[[1]][[RN]] , 1) , tail(Sum_avg_expression_each_GO[[2]][[RN]] , 1) , tail(Sum_avg_expression_each_GO[[3]][[RN]] , 1) , tail(Sum_avg_expression_each_GO[[4]][[RN]] , 1)) } names(Sum_exp_each_GO_celltype) <- names(genes_each_GO) # Create a big dataframe, each row represents each GO:BP, and each column represents each cell type in each condition. # matrix_size = 1 : (no_of_cell_types x biological_conditions) no_cell_types <- 4 no_samples <- 10 matrix_size <- no_cell_types ∗ no_samples Input_dot_plot <- as.data.frame(matrix(1:matrix_size, nrow = 1, ncol = matrix_size)) colnames(Input_dot_plot) <- c(paste("Monocytes" , names(Sum_exp_each_GO_celltype[[1]])) , paste("NK" , names(Sum_exp_each_GO_celltype[[1]])) , paste("T" , names(Sum_exp_each_GO_celltype[[1]])) , paste("B" , names(Sum_exp_each_GO_celltype[[1]]))) for (RN in 1:length(genes_each_GO)) { temp <- Sum_exp_each_GO_celltype[[RN]] temp <- cbind(temp[1,],temp[2,] , temp[3,] , temp[4,]) colnames(temp) <- c(paste("Monocytes" , names(Sum_exp_each_GO_celltype[[1]])) , paste("NK" , names(Sum_exp_each_GO_celltype[[1]])) , paste("T" , names(Sum_exp_each_GO_celltype[[1]])) , paste("B" , names(Sum_exp_each_GO_celltype[[1]]))) Input_dot_plot <- rbind(Input_dot_plot , temp) } Input_dot_plot <- Input_dot_plot[-1,] rownames(Input_dot_plot) <- names(genes_each_GO) * 22. Calculate the z-score across all samples and cell types for visualization ([115]Figure 2B). # For visualization, calculate z-score by row Input_dot_plot_zscore <- t(apply((Input_dot_plot[,1:length(Input_dot_plot)]), 1, function(x){ mean <- mean(x) SD <- sd(x) Z_score <- (x-mean)/SD Z_score })) Input_dot_plot_zscore <- as.data.frame(Input_dot_plot_zscore) # Convert to long format for plot forplot <- gather(Input_dot_plot_zscore %>% rownames_to_column("GO"), key = "ST" , value = "Expression" , -GO) forplot <- forplot %>% separate(col = "ST" , into = c("Cell_Types" , "Patients" , "time"),sep = " " , remove = F) forplot$time <- factor(forplot$time , levels = c("Day-2" , "Day-1" , "Def" , "Wk2" , "I" , "II")) forplot$Cell_Types <- factor(forplot$Cell_Types , levels = c("Monocytes" , "NK" , "T" , "B")) forplot$Patients <- factor(forplot$Patients , levels = c("DF" , "DHF" , "Healthy")) * 23. Visualize the results using the ggplot2 package.[116]^14 # Visualize the relative expression of HVGs over all samples and cell types ggplot(forplot, aes(x=time, y=GO , color= Expression , size = Expression)) + geom_point(alpha = 1.5) + theme_classic() + scale_colour_gradient2( low = "blue", mid = "white", high = "red", space = "Lab" , limits = c(-max(forplot$Expression),max(forplot$Expression)) ) + xlab("") + scale_size_continuous(range = c(5,5)) + facet_grid(∼Cell_Types+Patients , scales = "free", space = "free") + labs(color = paste("z-score")) + theme(axis.text.x = element_text(angle = 60, hjust=1), axis.text=element_text(size=20) , axis.title.y = element_text(size=25), strip.background = element_rect(colour = "white", fill = c("gray","darkorange3", "gray")), legend.text = element_text(size = 15), legend.title = element_text(size = 20),legend.key.size = unit(1, "cm")) + ylab("Significant GO terms") HVG identification and pathway enrichment analysis across early and late time points in immune cell types - COVID-19 case study Inline graphic Timing: ∼ 10–15 min In addition to the dengue data published in our earlier study,[117]^1 here we also demonstrate the application of this framework on another time-course scRNA-seq study. We retrieved processed scRNA-seq data from a COVID-19 study,[118]^10 which was deposited on FASTGenomics ([119]https://www.fastgenomics.org/). We selected two COVID-19 patients (“cohort 1”) with the donor IDs "C19-CB-0009″ and "C19-CB-0012″, whose samples were obtained at the same “days after symptom onset”; together with two healthy donors, "P15F″ and "P17H". We used the same cell type annotation as initially characterized by the authors of the study.[120]^10 Note: Using our framework to analyze this COVID-19 dataset, we observed distinct overall transcriptome profiles between early and late phases of the infection in each of the cell types of interest, while the differences between the two patients were relatively small ([121]Figure 3A). Moreover, the overall expression patterns of the two healthy donors were nicely grouped together and clearly separated from the COVID-19 patients at all the time points ([122]Figure 3A). We next identified HVGs and virtualized their dynamic expression patterns, for different groups of biological pathways that the HVGs are associated with ([123]Figure 3B). Interestingly, we observed clear up-regulation of the HVGs in the early infection associated with “response to type I interferon”, “interferon-mediated signaling pathway”, and “type I interferon-mediated signaling pathway” in monocytes ([124]Figure 3B, labeled in blue), which have also been described in the original paper,[125]^10 but using a different analytical framework.[126]^10 In addition to this, our analysis and visualization also revealed cell-type-specific expression dynamics of the HVGs associated with certain biological pathways, such as “natural killer cell mediated cytotoxicity” HVGs being specifically upregulated in NK cells; “T cell receptor V(D)J recombination” HVGs being expanded in CD4+ and CD8+ T cells, and several B cells-related pathways such as “B cell activation”, “humoral immune response”, and “B cell mediated immunity“ being up-regulated specifically in B cells at the early infection ([127]Figure 3B), all of which have not been mentioned in the original study.[128]^10 Figure 3. [129]Figure 3 [130]Open in a new tab The relative expression values of HVGs in each biological process (BP) across two COVID-19 patients during the early time point (d9; 9 days after symptom onset) and late time point (d16; 16 days after symptom onset), together with two healthy controls (A) PCA of average gene expression in each cell type. Different colors represent the days after symptom onset, while different shapes represent individual samples. (B) Dotplot showing the relative expression levels of HVGs related to each BP. The time-specific BPs described in the original paper[131]^10 are labeled in blue. The NK-specific BP is highlighted in purple, T-specific BP is in dark red, and B-specific BPs are in green. P1 = covid-19 patient 1; P2 = covid-19 patient 2; HC = healthy controls. d9 = 9 days after symptom onset; d16 = 16 days after symptom onset. Expected outcomes We propose an approach for identification of highly variable genes (HVGs) and investigation of their expression patterns across multiple time points (or biological conditions). Similarly to differentially expressed genes (DEGs) that are commonly used to represent the genes that might be activated or repressed differentially between two biological conditions, such as between patients vs.’ healthy controls, or treatment vs.’ normal controls, HVGs depict the genes that show the highest dynamics across multiple time points, and their functions might be linked to temporally specific biological processes over the course of the studies. The overall variations between multiple transcriptomes obtained in different biological conditions can be consolidated using Principal Component Analysis (PCA) and then visualized in a two-dimensional plot. With the scRNA-seq data, we can readily obtain and investigate the PCA plots of different cell types and subpopulations ([132]Figures 2A and [133]3A). HVGs of different cell types are also identified during this PCA process, and used as the inputs of pathway enrichment analysis. Relative expression patterns across time points of HVGs associated with biological processes of interest are then computed and visualized across different cell types in a single figure, as shown in [134]Figures 2B and [135]3B. As demonstrated using the dengue[136]^1 and COVID-19[137]^10 case studies, we can identify the biological processes whose HVGs demonstrate cell-type-specific expression dynamics across the time points, and hence suggest the links between time-point specific gene expression regulation and relevant biological processes in each of the cell types of interest. In addition to scRNA-seq, we expect that the framework will be adaptable for the investigation of temporal or longitudinal gene expression datasets obtained using other high-throughput and omic technologies, such as time-course bulk RNA-seq or proteomic studies. Limitations Batch effect can occur when the samples obtained using different technologies or platforms. In the protocol described here, all the samples were processed using the same versions of the 10x genomics single-cell kits. Hence, we did not observe technical batch effects. If multiple samples used in the analysis are generated by different techniques, it is advisable to inspect the batch effect by PCA before starting the HVG analysis. Library preparation techniques could appear as the factor most affecting the grouping of the samples. In such cases, it may not be possible to obtain HVGs that truly represent the most variances from biological conditions using our described protocols. It might be possible; however, to regress such batch effects using tools such as Harmony,[138]^28 Mutual Nearest Neighbors (MNN),[139]^29 and Seurat Canonical Correlation Analysis (CCA).[140]^30 Troubleshooting Problem 1 Can this protocol be used with a .h5ad input file for the HVG identification (related to step 9)? Potential solution A .h5ad file can be converted into a Seurat object, and the average gene expression can be calculated using the codes below. Convert a .h5ad file to a Seurat object. library(zellkonverter) sce <- readH5AD("PATH/TO/file.h5ad", verbose = TRUE) seurat_obj <- as.Seurat(sce, counts = "X", data = NULL) Calculate average gene expression levels in each cell type: Avg_expression_list <- list() for (RN in 1:length(each_celltype_list)) { Idents(each_celltype_list[[RN]]) <- "ST" Avg_expression_list[[RN]] <- AverageExpression(each_celltype_list[[RN]] , assays = "originalexp" , slot = "data") Avg_expression_list[[RN]] <- as.data.frame(Avg_expression_list[[RN]]$originalexp) } names(Avg_expression_list) <- names(each_celltype_list) Problem 2 The code described in step 17 is specific for the output generated from the gProfiler2 package.[141]^13 If different functional enrichment tools are used, the outputs might be different from the gProfiler2 one, which could result in an error when running step 17. Potential solution Users can alternatively create a list of GO terms themselves, as shown below. Each GO term in a list should contain genes that are related to that particular GO term. After creating this list, users should be able to continue with step 18. An example of a list, [142]graphic file with name fx4.jpg [143]Open in a new tab Problem 3 The “ambiguous” gene names are sometimes not recognized by the gProfiler2 package[144]^13 (related to step 13). Potential solution We encourage users to use the Ensembl IDs as default inputs. Problem 4 Large R, Rmd scripts, or R objects sometimes cannot be loaded and/or run on Rstudio (related to HVG identification and pathway enrichment analysis and investigating dynamic expression patterns of HVGs across all time points and cell types sections). Potential solution We encourage users to run the scripts using .R via R or RServer with less Graphic User Interface (GUI) instead. Resource availability Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead contact, Varodom Charoensawan (Email: [145]varodom.cha@mahidol.ac.th).[146]jantarikaarora@gmail.com, [147]anunya.opa@mahidol.edu Materials availability This study did not generate new unique materials. Acknowledgments