Graphical abstract graphic file with name fx1.jpg [49]Open in a new tab Highlights * • Includes an intuitive graphical user interface for interactive analysis of scRNA-seq data * • Allows non-computational users to analyze scRNA-seq data with end-to-end workflows * • Provides interoperability between tools across different programming environments * • Produces HTML reports for reproducibility and easy sharing of results The bigger picture Single-cell data can be used to understand complex biological systems. However, many single-cell analysis tools can only be used by trained computational biologists and are scattered across different programming languages. The Single-Cell Toolkit (SCTK) is a software package that brings together many different tools in one place and allows non-computational users to analyze their own data using a graphical user interface. Overall, SCTK gives computational and non-computational researchers the ability to access a wide variety of single-cell tools to perform complex analysis workflows. __________________________________________________________________ The Single Cell Toolkit (SCTK) is a software package that gives computational and non-computational researchers the ability to utilize a wide variety of tools and complex workflows for single-cell analysis. Introduction Single-cell RNA sequencing (scRNA-seq) is a molecular assay that can quantify the levels of mRNA transcripts for each gene in individual cells. This approach can be used to gain insights into cellular heterogeneity not previously possible with “bulk” transcriptomic assays.[50]^1^,[51]^2 Profiling the transcriptome of individual cells has revealed novel cell subpopulations in normal tissues and cell states associated with the pathogenesis of complex diseases.[52]^3 A large number of tools and software packages are available to perform different steps of scRNA-seq data analysis. However, these tools are spread across different programming environments and rely on different data structures for input of data or output of results. As the interoperability for tools between platforms is lacking, users generally have to choose a single analysis workflow or spend considerable effort manually converting data between environments running different tools and integrating results.[53]^4 Moreover, many researchers without strong computational backgrounds are generating scRNA-seq data but do not have the necessary training for analysis and interpretation. Currently, there are limited options for frameworks that allow for interoperability of tools across environments and contain a graphical user interface (GUI) for non-computational users to perform flexible end-to-end analysis.[54]^5^,[55]^6^,[56]^7^,[57]^8^,[58]^9 While some web applications are available for the analysis of scRNA-seq data, there are no online tools that can import data from a variety of formats, perform comprehensive quality control and filtering, run flexible clustering and trajectory workflows, and apply a series of downstream analysis and visualization tools within an interactive interface amiable to users without a strong programming background. We have previously developed the Single Cell Toolkit (SCTK), which is implemented in the R/Bioconductor package singleCellTK for comprehensive importing and quality control of scRNA-seq data.[59]^10 SCTK2 greatly expands our previous package[60]^10 with a variety of tools for normalization, integration, dimensionality reduction, clustering, trajectory analysis, cell-type labeling, pathway scoring, and visualization. SCTK2 facilitates interoperability between workflows and programming environments by integrating tools from Seurat, Bioconductor, and the Python-based Scanpy package. All of the end-to-end analysis workflows are accessible using a “point-and-click” GUI powered by a Shiny app to enable users without programming skills to analyze their own data. The R/Shiny app is available online at [61]https://sctk.bu.edu and gives users the ability to analyze their own data without access to substantial computational resources. When compared with existing tools, the SCTK2 framework offers more options for analysis, interactive visualization, and generation of HTML reports for reproducibility. Results Overview of the general framework singleCellTK (SCTK) is an R package that provides a uniform interface to popular scRNA-seq tools and workflows for quality control, clustering or trajectory analysis, and visualization. SCTK2 gives users the opportunity to seamlessly run different tools from different packages and environments during different stages of the analysis. Tools can be run by computational users in the R[62]^11 console, by non-computational users with an interactive GUI developed in R/Shiny,[63]^12 or with HTML reports generated with Rmarkdown. SCTK2 utilizes multiple Bioconductor Experiment objects such as the SingleCellExperiment (SCE) as the primary data container for storing expression matrices, reduced dimensional representations, cell and feature annotations, and other tool outputs.[64]^13^,[65]^14 Flexible and comprehensive workflows for scRNA-seq analysis The major steps of the SCTK2 workflow can be divided into three major components: (1) importing, quality control, and filtering, (2) normalization, dimensionality reduction, and clustering, and (3) various downstream analyses and visualizations for exploring biological patterns of the cell clusters ([66]Figure 1). For the first component, we have included the ability to import data from 11 different preprocessing tools or file formats including Seurat objects or SCE objects from R and AnnData objects from Python. SCTK2 generates standard quality control (QC) metrics such as the total number of counts, the features detected per cell, or the mitochondrial percentage using the scater R package.[67]^15 Doublet detection can be performed with 3 different tools from R and 1 tool from Python. Ambient RNA quantification and removal can be performed with DecontX[68]^16 or SoupX.[69]^17 For filtering, users can choose to exclude cells or genes based on one or a combination of QC metrics produced by the various QC tools. Figure 1. [70]Figure 1 [71]Open in a new tab Overview of analysis workflows available in SCTK2. Analysis of scRNA-seq data can be divided into three major parts: importing and quality control (QC), clustering workflows, and downstream analysis. For importing and QC (top), SCTK2 can import data from many different upstream preprocessing tools and formats. A variety of metrics for general QC, empty drop detection, doublet detection, and ambient RNA quantification can be calculated and displayed for each sample. For clustering workflows (middle), SCTK2 provides an à la carte workflow that allows users to pick and choose different tools at each step for normalization, batch correction, or integration, dimensionality reduction, and clustering. For downstream analysis (bottom), SCTK2 provides access to additional tools and analyses for differential expression, cell-type labeling, pathway analysis, and trajectory analysis. Overall, the toolkit provides a wide variety of methods for each part of the analysis workflow. The major steps for the clustering workflows include normalization, selection of highly variable genes (HVGs), dimensionality reduction such as principal-component analysis (PCA), clustering, and 2D embedding such as uniform manifold approximation and projection (UMAP). SCTK2 provides an “à la carte” workflow, which allows users to pick and choose different tools at each step ([72]Figure 1). Users also have the option to perform batch correction or integration after normalization with 8 tools including 6 from R and 2 from Python. SCTK2 also provides access to several “curated workflows” that allow users to select from specific tools or functions in predefined workflows from other packages ([73]Figure 2). Curated workflows include those from the Seurat[74]^18^,[75]^19^,[76]^20^,[77]^21 and celda[78]^22 packages in R and the scanpy[79]^23 package in Python. All three curated workflows can be used to cluster cells and produce 2D embeddings, whereas celda can also be used to cluster genes into co-expression modules. Figure 2. [80]Figure 2 [81]Open in a new tab Overview of curated analysis workflows In addition to the à la carte clustering workflow, SCTK2 provides access to workflows from the R packages Seurat and celda as well as the Python package scanpy. Users can recapitulate the analysis, results, and plots from each package all while using the common and unified interface in SCTK2 without having to know the underlying commands from each package. Functions for normalization, variable feature selection, dimensionality reduction, and clustering are available from the Seurat and scanpy workflows. Celda can be used to group cells into clusters and genes into modules. Downstream analyses after clustering include finding markers for cell clusters using differential expression (DE); DE analysis between user-specified conditions; automated cell-type labeling with SingleR[82]^24; pathway enrichment analysis with gene set variation analysis (GSVA),[83]^25 variance-adjusted Mahalanobis (VAM),[84]^26 or Enrichr[85]^27^,[86]^28; and trajectory analysis with TSCAN[87]^29 ([88]Figure 1). All of these analyses can be applied after the à la carte or curated workflows. DE analysis can be performed with the Wilcoxon rank-sum test, MAST,[89]^30 limma,[90]^31 ANOVA, or DESeq2[91]^32 and visualized with heatmaps or volcano plots. The expression of individual genes can be displayed on 2D embeddings, violin plots, or boxplots. Finally, results from SCTK2 can be exported as flat text files (e.g., MTX, TXT, CSV), an SCE object, a Seurat object, or an AnnData[92]^23^,[93]^33 object to allow for further analysis and integration with other tools. Interactive analysis with the SCTK2 GUI Users without a strong programming background can analyze scRNA-seq data with the interactive GUI built with Shiny and available at [94]https://sctk.bu.edu ([95]Figure 3A). The major steps in the à la carte analysis are accessible via the menus in the top navigation bar. Within each major section, parameters to run tools can be selected in the left panel, and results will be displayed in the right panel. Many plots can be customized with additional options such as the choice of the embedding in a scatterplot or choosing to color the points by a particular metric or label. A “next steps” panel provides a “wizard”-like guide by suggesting links to the most common next steps. The curated workflows can be run using a “vertical tab” walkthrough format ([96]Figure 3B). SCTK2 also has a general visualization tab called the “Cell Viewer,” which supports functionality to generate and visualize custom scatterplots, bar plots, and violin plots for user-selected genes or gene sets. Additionally, a generic heatmap plotting tab can be used to visualize the expression levels of multiple features from an expression matrix along with a variety of cell or feature annotations. The majority of plots are made interactive with the plotly[97]^34 package and can be highlighted, cropped, zoomed in on, and saved in various formats. Figure 3. [98]Figure 3 [99]Open in a new tab Interactive analysis of scRNA-seq data with a graphical user interface (GUI) SCTK2 allows non-computational users to analyze scRNA-seq data using an interactive GUI built with R/Shiny that can be hosted on a web server. (A) (1) The menu bar allows the users to navigate through the main sections including data importing, QC, the à la carte clustering workflow, and downstream analysis. (2) Within each major section, parameters to run tools can be selected in the left panel. (3) Results and plots will be displayed in the right panel. (4) Many plots can be customized with additional options such as changing the color of points to reflect different phenotypes. (5) A “next steps” panel provides a “wizard”-like guide by suggesting links to the recommended next steps. (B) The curated workflows for Seurat, celda, and scanpy can be used to run a series of predefined steps using vertical tabs. (1) Curated workflows can be selected from the top navigation menu bar. The Seurat curated workflow is shown as an example. (2) Steps for normalization, feature selection, dimensionality reduction, clustering, 2D embedding, and finding markers can be selected and run using the vertical tabs. (3) Within each major section, parameters to run tools can be selected in the left panel, and (4) results and plots will be displayed in the right panel. (5) Within the Seurat curated workflow, an extra section is given for exploring expression of features using UMAPs, heatmaps, and violin plots. Reproducible and sharable analysis with HTML reports SCTK2 can generate HTML reports for QC tools, DE results, differential abundance (DA) results, and the curated workflows. These reporting tools can be used to plot and share a previously run analysis or start a new analysis workflow de novo with user-specified parameters. The output of these functions is a comprehensive HTML report that describes the input data, run parameters, and results with the standard visualizations. These reports provide reproducibility and offer a quick and easy way to explore and share the results of an individual analysis or whole workflow. For example, the “Seurat run” and “Seurat results” reports allow users to recapitulate the entire Seurat curated workflow ([100]Figure 4). Users can select and review each step of the workflow using the content menu on the left. Each section contains a description for that step of the workflow along with the plot or results that were produced by that step. Code used to produce the plots can also be viewed. For example, the “clustering” section shows different choices of the “resolution” parameter in different tabs to allow users to explore different sets of cluster labels. The “Seurat results” section allows users to view heatmaps and UMAPs of the marker genes for each cluster. These reports can be generated with a single command and thus streamline the process of generating sharable figures and tables along with descriptions. Figure 4. [101]Figure 4 [102]Open in a new tab Facilitating reproducibility and sharing of results with HTML reports SCTK2 provides the ability to generate HTML reports for several individual analyses or entire workflows to enable reproducibility and facilitate sharing of results. An HTML report for clustering of PMBC data with Seurat is shown as an example. (1) Different steps that were run in the workflow can be selected with the content menu on the left of the report. (2) In each section, a description of the step or tool and the selected parameters are shown at the top, and (3) the code used to produce the plot can be expanded. (4) The results and plots are shown on the right side. The “clustering” section shows different choices of the “resolution” parameter in different tabs to allow users to easily explore different sets of cluster labels. Benchmarking We benchmarked the ability of the SCTK2 to analyze four datasets of different sizes. Two datasets of peripheral blood mononuclear cells (PBMCs) were obtained from 10× Genomics that contained 5,419 (pbmc6k) and 68,579 cells (pbmc68k). Two more datasets of immune cells were obtained from the “1M Immune Cells” project from the Human Cell Atlas that contained 100,000 (immune100k) and 300,000 cells (immune300k). The workflow consisted of the following steps: importing data from sparse matrix files, generating QC metrics, filtering, normalization, variable feature selection, dimension reduction, 2D embedding, clustering, and marker detection. We recorded the RAM usage for the SCE object after each step ([103]Figure 5A), the peak RAM allocation that was used during each step ([104]Figure 5B), and the time elapsed during each step ([105]Figure 5C). The largest RAM usage for the SCE object was 6.23 GB and occurred after the marker detection step for the largest dataset. The largest peak RAM usage was 16.65 GB and occurred during the importing step of the largest dataset (16.65 GB). The longest time elapsed was 80.46 min and was contributed by the marker detection step for the largest dataset. These results demonstrate that the SCTK2 GUI deployed on a server with typical memory availability (e.g., 64 GB) can be used to analyze many standard single-cell datasets for several users at a time. Figure 5. [106]Figure 5 [107]Open in a new tab Benchmarking of RAM and CPU usage for datasets of different sizes RAM allocation and elapsed time was benchmarked for four datasets (pbmc6k, pbmc68k, immune100k, and immune300k) using a Bioconductor-based analysis workflow. (A) The RAM usage for the output SCE object after each step is shown for each dataset. (B) The peak RAM usage during each step is displayed for each dataset. (C) The time elapsed during each step is displayed for each dataset. The left part zooms in on the y axis of the right part. Comparison with other tools with GUI for scRNA-seq analysis Some other tools and packages are available that provide a GUI to scRNA-seq data analysis. We compared the availability of supported methods between SCTK2 and Pegasus,[108]^5 ASAP,[109]^6^,[110]^7 BingleSeq,[111]^8 and CReSCENT[112]^9 ([113]Table S1). Generally, SCTK2 supports more methods and options for the various stages of a typical scRNA-seq analysis. Particularly, SCTK2 has more options for importing from different data sources and supports more QC algorithms. Similar to SCTK2, several methods and workflows are available in Pegasus. However, the GUI in Pegasus is only available via Jupiter Notebooks in the Terra Cloud platform, and non-computational users need to have access to a Cloud account and a Terra workspace before they can fully utilize this tool. Options for ASAP that are not in SCTK2 include voom and DESeq2 for normalization, M3Drop for variable feature detection, and Seurat Leiden, hierarchical, and SC3 methods for clustering. BingleSeq has Monocle for trajectory analysis and dot plots for visualization. Lastly, CReSCENT has dot plots for visualization. With respect to trajectory analyses, SCTK2 uses TSCAN, while Pegasus supports diffusion maps and BingleSeq includes Monocle. Discussion SCTK2 provides an intuitive and easy-to-use GUI that integrates a variety of widely used methods into a single end-to-end workflow. Instead of having to switch between different graphic-based tools or learning a programming language to run a method that utilizes specific data structures, users can use the “point-and-click” GUI to access existing analysis methods for scRNA-seq data. Features available in the GUI include the ability to import scRNA-seq data from a variety of formats, import and edit annotations for genes and cells; run QC analysis and apply filters; and normalization, dimensionality reduction, clustering, DE, pathway analysis, trajectory analysis, and interactive visualization. The ability to easily generate comprehensive HTML reports enables quick sharing between collaborators and reproducibility of results. As a large number of tools have been developed for scRNA-seq analysis, we prioritized tools that have been widely used and have stable code bases in standard repositories. We also performed benchmarking using datasets with different numbers of cells to demonstrate that our platform can analyze small, medium, and large datasets in a reasonable amount of time. In the future, the singleCellTK package will be updated to utilize the MultiAssayExperiment and ExperimentSubset packages to store and manipulate both multimodal data and subsets of existing datasets using a single underlying object and from the same interactive interface. Overall, these features make SCTK2 a convenient toolkit for the analysis of scRNA-seq data regardless of a person’s programming background. Experimental procedures Comprehensive importing SCTK2 enables importing data from the following preprocessing tools: CellRanger,[114]^35 Optimus, DropEst,[115]^36 BUStools,[116]^37^,[117]^38 Seqc,[118]^39 STARSolo,[119]^40^,[120]^41 and Alevin.[121]^42^,[122]^43 In all cases, SCTK2 parses the standard output directory structure from the preprocessing tools and automatically identifies the count files to import. These functions also support importing of count matrices stored in plain text files (e.g., MTX, CSV, and TSV formats), of SCE objects saved in an RDS file, and AnnData objects saved in a h5ad file. The Shiny GUI allows users to specify the location of files for multiple samples on their local device. The data for these samples are uploaded and combined into a single SCE object to use across analyses. QC and filtering Performing comprehensive QC is necessary to remove poor-quality cells for downstream analysis of scRNA-seq data. Within droplet-based scRNA-seq data, droplets containing cells must be differentiated from empty droplets. Therefore, assessment of the data is required, for which various QC algorithms have been developed. In SCTK2, we support EmptyDrops[123]^44 and BarcodeRank[124]^45 tools in R for detection of empty droplets. General QC metrics include the total number of counts, the number of features detected, and the percentage of mitochondrial reads. Tools for doublet detection include Scrublet[125]^46 from Python and scDblFinder,[126]^47 cxds, bcds, a hybrid of cxds and bcds from scds[127]^48 and doubletFinder[128]^49 from R. Tools for detection and removal of ambient RNA include decontX[129]^16 and SoupX.[130]^17 The metrics computed from these algorithms can be visualized on a 2D embedding or violin plot. Based on these metrics, users can filter the cells by selecting an appropriate metric and a cutoff value. The filtered data are stored in a separate SCE object and can be utilized in all subsequent analyses. À la carte analysis workflow The à la carte analysis workflow includes the main interface and the functions of the toolkit that let the users select and pick different methods and options for various steps of the analysis workflow including normalization, batch correction or integration, feature selection, dimensionality reduction, 2D embedding, and clustering. Normalization SCTK2 offers a convenient way to normalize data for downstream analysis using a number of methods available through the toolkit. Normalization methods available with the toolkit include “LogNormalize,” “CLR,” “RC” and “SCTransform” from the Seurat R package, and “logNormCounts” and “CPM” from the scater R package. Additional transformation options are available for users including “log,” “log1p,” trimming of data assays, and Z score scaling. Batch correction and integration SCTK2 provides access to methods for batch correction and integration of samples from R packages including Batchelor (MNN),[131]^50 ComBat (sva),[132]^51^,[133]^52 limma,[134]^31 scMerge,[135]^53 Seurat, and ZINBWaVE,[136]^54 as well as Python packages including BBKNN[137]^55 and Scanorama.[138]^56 These methods accept various types of input expression matrices (e.g., raw counts or log-normalized counts) and generate either a new corrected expression matrix or a low-dimensional representation of the integrated data. Feature selection Several methods are available to compute and select the most variable features to use in the downstream analysis. Feature selection methods available with the toolkit include “vst,” “mean.var.plot,” and “dispersion” from the Seurat R package and “modelGeneVar” from scran R package.[139]^57 The top variable genes can be visualized through the toolkit in a scatterplot of the genes or features using the mean-to-variance or mean-to-dispersion plot depending upon the algorithm used. Dimensionality reduction and 2D embedding The toolkit provides access to both PCA and ICA (independent-component analysis) algorithms from multiple packages for reducing the expression matrices into reduced dimensions. PCA is implemented from both scater and Seurat R packages, while implementation of ICA is only available from Seurat. Reduced dimensions computed from these methods can be visualized through various plots including component plot, elbow plot, jackstraw plot, and heatmaps. 2D embedding methods available with the toolkit include “tSNE” and “UMAP” from the Seurat package, “tSNE” from the Rtsne package, and “UMAP” from the scater package. The results computed from these methods can also be visualized using a 2D scatterplot. Clustering Graph-based clustering methods available within SCTK2 include “Walktrap,”[140]^58 “Louvain,”[141]^59 “infomap,”[142]^60 “fastGreedy,”[143]^61 and “labelProp”[144]^62 from the scran R package or “Louvain,” “multilevel,”[145]^63 or “SLM”[146]^64 from the Seurat R package. Additionally, K-means methods can be run using “Hartigan-Wong,” “Lloyd,” or “MacQueen” algorithms from the stats R package. Curated workflows SCTK2 provides access to Seurat, Scanpy, and Celda analysis workflows through a streamlined and guided interface. Seurat is a widely used R package that implements various methods for processing and clustering of scRNA-seq data. Similarly, Scanpy is a Python package that also provides methods for analyzing scRNA-seq data. Celda is an R package that performs co-clustering of genes into modules and cells into subpopulations. In the SCTK2 GUI, all the steps of the Seurat, Scanpy, and Celda workflows can be run in a “step-by-step” fashion with the “vertical” layout. These curated workflows allow new or beginner users to quickly run an exploratory analysis of single-cell data without having to try too many combinations of parameters or tools. DE and marker selection The toolkit offers DE in a group-vs.-group way using one of the five implemented methods including Wilcoxon rank-sum test, MAST, limma, DESeq2, or ANOVA. Alternatively, users can also use the DE methods in a “find marker” analysis to identify the top marker genes for each group of cells against all the other cells. The results for both approaches can be viewed through tables that display the top differentially expressed genes or marker genes along with the metrics computed by the selected method. Cell-type labeling Cell-type labeling from a reference can be performed with the SingleR package in R. SingleR works by comparing the expression profile of each single cell to an annotated reference dataset and labels each cell with a cell type of the highest likelihood. SingleR can also label clusters of cells instead of individual cells. The cell-type assignments of clusters or individual cells can be visualized on a 2D embedding in the same fashion as labels from de novo clustering algorithms. Pathway analysis Custom gene sets can be imported by the users or automatically downloaded from the MsigDB[147]^65 using the R package msigdb. Methods for scoring the levels of a gene set in each individual cell include VAM[148]^26 and GSVA in R.[149]^25 The scores for gene sets can be used in a DE analysis to compare different cell annotations such as cell type or experimental condition. The distribution of gene set scores can be visualized using violin plots. The EnrichR R package[150]^27^,[151]^28 can be used to determine if sets of genes are enriched for biological pathways in curated databases such as KEGG,[152]^66 GO,[153]^67 and MsigDB. Trajectory analysis Cell trajectory can be constructed by building a cluster based minimum spanning tree (MST) and estimating pseudotime on the paths with the TSCAN R package.[154]^29 Based on the trajectory, SCTK2 also provides TSCAN methods to test features that are differentially expressed on a path or between paths. The pseudotime value or the expression of DE features can be visualized on a 2D embedding with the MST projected and overlaid on it. Benchmarking The pbmc6k and pbmc68k datasets were obtained using the importExampleData() function, which utilized the TENxPBMCData package (v.1.12.0) and the ExperimentHub package (v.2.2.1) to retrieve the data. The immune100k and immune300k datasets were retrieved and downsampled from the Human Cell Atlas Portal. All datasets were exported to MTX format. The workflow that was benchmarked included steps for (1) importing the data from an MTX file using the importFromFiles() function; (2) calculation of general QC metrics using the runPerCellQC() function; (3) normalization using the runNormalization() with the “logNormCounts” method; (4) calculation of variable features using the runFeatureSelection() function with the “modelGeneVar” method; (5) dimensionality reduction using the runDimReduce() function with the “scaterPCA” method; (6) UMAP embedding using the runDimReduce() function with the “scaterUMAP” method; (7) clustering using the runScranSNN() function with the “Louvain” method; and (8) a differential gene expression analysis using the runFindMarker() function with the “Wilcox” method. For each of the steps, we used the peakRAM() function from the peakRAM package (v.1.0.2) to record the RAM used by the SCE object after the completion of each step and the peak RAM allocation used during each step, as well as the time elapsed for each step. All the analyses were performed on an x86_64 Linux cluster node, configured with 404 GB RAM and an Intel Xeon 2.80 GHz CPU with 32 cores. Data analysis All figures in the results were generated by analyzing the PBMC and immune datasets with the singleCellTK package v.2.8.1. Code to reproduce the analysis can be found at [155]https://github.com/campbio-manuscripts/SCTK2. Experimental procedures Resource availability Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Joshua D. Campbell ([156]camp@bu.edu). Materials availability This study did not generate new unique materials. Acknowledgments