Abstract

   Single-cell transcriptomics profiling has increasingly been used to
   evaluate cross-group (or condition) differences in cell population and
   cell-type gene expression. This often leads to large datasets with
   complex experimental designs that need advanced comparative analysis.
   Concurrently, bioinformatics software and analytic approaches also
   become more diverse and constantly undergo improvement. Thus, there is
   an increased need for automated and standardized data processing and
   analysis pipelines, which should be efficient and flexible too. To
   address these, we develop the single-cell Differential Analysis and
   Processing Pipeline (scDAPP), a R-based workflow for comparative
   analysis of single cell (or nucleus) transcriptomic data between two or
   more groups and at the levels of single cells or ‘pseudobulking’
   samples. The pipeline automates many steps of pre-processing using
   data-learnt parameters, uses previously benchmarked software, and
   generates comprehensive intermediate data and final results that are
   valuable for both beginners and experts of scRNA-seq analysis.
   Moreover, the analytic reports, augmented by extensive data
   visualization, increase the transparency of computational analysis and
   parameter choices, while facilitate users to go seamlessly from raw
   data to biological interpretation. scDAPP is freely available under the
   MIT license, with source code, documentation and sample data at the
   GitHub ([35]https://github.com/bioinfoDZ/scDAPP).

Introduction

   Advancements in single-cell transcriptomics technologies and reductions
   in cost have greatly increased the scale and complexity of experiments
   using single-cell and single-nucleus RNA-seq (scRNA-seq and snRNA-seq)
   ([36]1). In 2023, nearly 4000 papers were published using scRNA-seq,
   illustrating an exponential increase over the past 5 years (PubMed).
   Not only have these methods been used for routine categorization of
   cell population in tissues or biological samples, but also increasingly
   as read-outs of population and gene program changes across experimental
   conditions, such as genetic knockouts or drug treatments. As the
   technologies for data acquisition become more sophisticated, the
   bottleneck of innovation has shifted to efficient and rigorous
   bioinformatic analysis, to guide investigators to assess data quality
   rapidly and use benchmarked software for uncovering biological signals
   efficiently and robustly. Standard, scalable and modular workflows
   would greatly facilitate this process.

   Bioinformatic algorithms and software for single cell data analysis
   have also evolved rapidly and become challenging for ordinary
   researchers to follow. Various benchmark comparisons of software,
   however, have provided excellent recommendations for selecting methods
   for most data analysis steps, such as statistically rigorous strategies
   for cross-experimental-condition comparisons, including differential
   expression analysis and differential cell composition analysis
   ([37]2–5). Such studies have shown that the most important criterion
   for robust methodological performance is the capacity to explicitly
   model biological replicate variability via ‘pseudobulking’, which is to
   aggregate information (e.g. counts) for cells of the same type in each
   replicate. The alternative of treating data of individual cells as
   independent measurements could easily lead to inflated and
   inappropriate statistics, due to inherent correlation among cells in
   the same sample and large cell numbers, resulting in small effect-size
   discovery that is less biologically relevant and prone to noises.
   Related to this, advancements in sample multiplexing, such as Cell
   Hashing, Multi-Seq or 10X Cell Multiplexing, have significantly
   increased the number of replicates (at reduced cost) and complexity of
   data analysis ([38]6,[39]7). Additionally, we demonstrated previously
   that the Reference Principal Component Integration (‘RPCI’) algorithm,
   released in the Robust Integration of scRNA-seq data (‘RISC’) software
   package, could integrate multiple-sample data with high accuracy, while
   avoiding over-correction ([40]8).

   We present here an R-based pipeline called single-cell Differential
   Analysis and Processing Pipeline (scDAPP) for cross-group comparative
   analysis of scRNA-seq and snRNA-seq data from 10X Genomics platform.
   Our design emphasizes scalability, ease of use, user-friendly graphic
   visualization, transparency, and reproducibility. Compared to similar
   pipelines, including Cellsnake, Cellenics, scDrake, SingleCAnalyzer,
   and the Single-Cell Omics workbench ([41]9–13), scDAPP focuses more on
   comparative analysis across groups or conditions, with implementations
   specifically for using replicates. Furthermore, it supports complex
   multi-group comparisons (such as Drug A versus Drug B versus Control)
   and comprehensive downstream bioinformatics analysis, such as Gene Set
   Enrichment Analysis (GSEA) and transcription factor target analysis.
   Additionally, flexible input formats allow for direct data importing
   from raw CellRanger outputs or pre-processed scRNA-seq data objects.
   Importantly, using enriched visualization, scDAPP is designed to guide
   users to examine and understand the selection of parameters in each
   step of the data analysis, so that they can have a good grasp of the
   options and make appropriate and rational adjustments. The enriched
   visualization is especially valuable because it shows the underlying
   data distributions and moreover how parameter choices affect the
   analytic results. Overall, scDAPP facilitates and systematizes data
   processing steps and allows users to quickly delve into biological
   interpretation, thus advancing the rigor and scalability of single-cell
   transcriptomic data analysis.

Materials and methods

Overview of scDAPP core functions

   The scDAPP wraps previously published software packages into R markdown
   codes, with critical pipeline-specific implementation (Figure [42]1A).
   It starts with several cell filtering functions using parameters learnt
   from input data's distribution, including doublet removal, followed by
   individual sample analysis, integration of samples, clustering of the
   integrated data, and cross-group comparisons of cell cluster abundance
   and gene expression using either cell-level or pseudobulk-level data
   (when replicates are used). The core packages include Seurat (v5 and
   up), RISC/RPCI, Speckle, EdgeR, DESeq2, FGSEA and ClusterProfiler for
   various analyses ([43]8,[44]14–17), with critical function extensions
   or modifications specific for scDAPP. Label transfer from existing
   cell-type annotation is an optional feature to facilitate cell cluster
   identification.

Figure 1.

   [45]Figure 1.
   [46]Open in a new tab

   Overview of steps, inputs and outputs of scDAPP. (A) Overview of scDAPP
   steps and critical options. (B) Inputs of scDAPP. (C) Example
   illustrating configuration of inputs, matching gene expression data
   with metadata. (D) Tree structure of scDAPP output files.

scDAPP run configuration and input data

   Key input options and formats for the pipeline are shown in Figure
   [47]1B and [48]C. The input for scDAPP is the raw (non-normalized)
   Unique Molecular Identifier (UMI) counts matrix for each sample. This
   can be either the filtered matrices from CellRanger
   (‘filtered_feature_bc_matrix.h5’ files) or Seurat objects. The latter
   allows for flexibility, for example, to use data that have been
   filtered or processed by other means. In addition, pipeline run
   configuration files are needed, which specify sample-wise metadata
   including per-sample information and optional sample nickname codes
   (‘sample_metadata’), and a list of the relevant cross-group comparisons
   (‘comps’) (Figure [49]1B).

   One critical input parameter is ‘Pseudobulk_mode’ (TRUE/FALSE) that
   will specify whether to make cross-group comparisons using samples in
   the same group/condition as replicates (Figure [50]1A). Setting it to
   ‘TRUE’ is highly recommended when biological replicates are available.
   This has two effects: (i) Differential gene expression analysis will be
   run in a pseudobulk manner, invoking the EdgeR-Likelihood Ratio Test
   (or DESeq2) on the aggregated reads from all cells in each cluster in
   each replicate (2). (ii) Differential cell composition analysis will
   also be run in a replicate-aware manner via the ‘Propeller’ test from
   the ‘speckle’ package, which was demonstrated to perform more
   accurately (3). Conversely, setting the ‘Pseudobulk_mode’ parameter to
   ‘FALSE’ will evoke scDAPP to perform differential gene expression
   analysis by the Wilcoxon or other tests in Seurat at the single cell
   level (i.e. treating each cell as an independent data point), and
   differential cell abundance analysis by the two-proportion Z test. This
   non-replicate option is only recommended for comparative analysis
   without replicates but can be run on inputs with replicates. It is more
   prone to false positive, but this is sometimes unavoidable, such as in
   the context of pilot studies.

   Two other key options are provided (Figure [51]1A). The first is
   ‘use_labeltransfer’ for invoking the use of the Seurat label transfer
   workflow. If set to ‘TRUE’, two additional parameters need to be
   provided: one refers to a Seurat object containing normalized data from
   the Seurat's SingleCellTransform workflow and a metadata column called
   ‘Celltype’ to be used for label transfer, and the other (‘m_reference’)
   points to a table file listing marker genes for the reference cell
   types in the output format of the Seurat ‘FindAllMarkers’ function. The
   second is the ‘risc_reference’ parameter, specifically related to the
   RISC software. The RPCI algorithm uses one of the input samples to
   learn the principal component space to project cells in all samples.
   Users can specify which sample to be the RPCI reference, after they
   examine the clustering results from all samples. Alternatively, and by
   default, scDAPP provides an automated RISC reference selection
   algorithm (described below).

   The pipeline input has additional fields related to quality control
   metric thresholds, tuning of data-driven cell filtering, and
   hyperparameter selection in clustering analysis, with reasonable
   defaults, as described below. One more required input is ‘species’,
   which is used to search the MSIGDB database for the correct gene
   symbols during pathway enrichment analysis, via the msigdbr package
   ([52]18).

Description of step-by-step components

QC and cell filtering

   As shown in Figure [53]1A, the first step in scDAPP is quality control
   (QC) and filtering of poor-quality cells. On top of pre-set relaxed
   thresholds, scDAPP tries to learn better cutoffs from the input data
   directly. Currently scDAPP considers the following information for cell
   filtering: number of UMIs per cell, number of unique genes (i.e.
   features) per cell, percent of UMIs from the mitochondrial genes per
   cell, and the prediction of doublets and multiplets. Additionally,
   though not commonly implemented in other software, scDAPP evaluates the
   percent of reads from hemoglobin-related genes, as red blood cell lysis
   buffer may miss its target population; these cells have the very
   distinct feature of extremely high hemoglobin gene expression. For each
   of these QC metrics, users may select initial relaxed thresholds, such
   as minimum UMIs (500 by default), minimum number of unique genes (200),
   maximum percent of mitochondrial reads (25%), and maximum percent of
   hemoglobin read (25%). On top of these user's settings, scDAPP will
   further optimize the thresholds by analyzing the distribution of the
   underlying data. The distributions and the corresponding data-driven
   cutoffs are presented to the users graphically so that they may select
   thresholds deemed more appropriate. Essentially, the algorithms for
   deriving these data-driven cutoffs try to emulate the common manual QC,
   such as visual inspection of violin plots of these variables. First, a
   ‘complexity’ filter is applied, removing cells with a
   lower-than-expected number of unique genes given the number of observed
   UMIs. This is determined by two regression models, linear regression
   and LOESS regression with the log(nGenes) as the dependent variable and
   log(nUMI) as the predictor. Cells are considered low-complexity
   outliers if their linear regression Cook's distance is greater than 4 /
   number of cells, and their LOESS regression scaled residual value is
   less than −5, by default. The residual threshold is passed as a user
   parameter with higher values increasing the strictness of the filter.
   Next, scDAPP will learn data-driven cutoffs for the number of UMIs and
   the percent of mitochondrial reads (%mt). For these, robust statistics
   methods are applied, where cells with median absolute deviations
   above +2.5 (by default, tunable) for %mt and below −2.5 (default) for
   number of UMIs are considered low-quality outliers and removed. Again,
   users have the option to ignore and overwrite these data-driven cutoffs
   entirely.

   After poor quality cells are removed, scDAPP optionally uses
   DoubletFinder to predict doublets for removal from further analysis
   ([54]19). DoubletFinder hyperparameters such as the homotypic doublet
   rate are automatically estimated for each sample using the number of
   cells and the empirical multiplet rate provided by 10X Genomics
   ([55]20).

Individual sample clustering analysis

   After cell filtering, individual sample analysis is performed with
   Seurat using the modified SingleCellTransform (SCT) workflow
   ([56]14,[57]21). For each sample, scDAPP applies SCT, principal
   component analysis (PCA), graph construction, Louvain clustering, and
   Uniform Manifold Approximation and Projection (UMAP) for visualization.
   The hyperparameters including number of PCs (30 by default) and Louvain
   resolution (default 0.5) can be specified by the user, and iterated if
   needed, after examining the result from default settings.

Automated reference-based cell annotation via label transfer

   While cluster markers are always computed, cluster annotation by label
   transfer using Seurat is an option, as described above. Two extensions
   are made in scDAPP to help users. (i) We apply a hard cutoff of label
   transfer score of 0.3, below which cells are considered non-classified,
   and a soft threshold of 0.5, below which cells are considered only
   putatively classified. (ii) Seurat label transfer gives a score and
   label for each cell, but we extend this to the cluster level by setting
   the transferred annotation to the most common predicted cell type for
   each cluster. To show the quality of label transfer, scDAPP generates
   cluster-level violin plots for the label transfer scores and provides
   heatmap visualization of the expression of the reference marker genes.
   Automated cell calling tools like label transfer are useful for
   providing a suggested cell annotation, but the results should be
   carefully examined and thus scDAPP makes all the relevant plots and
   scores available to the users.

Sample integration and batch correction

   Next, scDAPP uses the RISC workflow for integration and batch
   correction, starting from the raw data matrices before Seurat analysis.
   RISC is also used for Louvain clustering and UMAP visualization of the
   integration data. There are scDAPP-specific extensions of the RISC
   package. By default, RISC uses the intersect of genes from the single
   cell objects as integration features, but this can leave out genes
   detected in only one sample. This can be problematic if one cell type
   (or corresponding marker genes) is absent in one sample or one of the
   comparison groups, potentially leading to missing marker genes for that
   cell type after integration. To overcome this, scDAPP reads the raw
   count data directly to RISC, selects cells filtered in the individual
   sample analysis, concatenates the cell-filtered matrices, and then
   performs gene filtering, such that more genes are retained. Several
   other key parameters are allowed for RISC integration analysis,
   including the number of PCs, the number of neighbors during graph
   construction, and clustering resolution. The default for these in
   scDAPP are reasonable: PCs = 30, resolution = 0.5, neighbors = 10, but
   they may be modified by users.

   Another important extension of the RISC package in scDAPP is automated
   selection of a reference sample for multiple-sample integration.
   Currently, this is done manually. RISC users examine a panel of plots
   generated by the ‘InPlot’ function of RISC, which describe the number
   of clusters, variance per PC, and a measure of distributional
   divergence for each sample ([58]8), and then select a sample for
   integration reference. We decided to automate this process in scDAPP by
   calculating a heuristic reference score for each sample. This score is
   based on the number of clusters, the cluster diversity, and the number
   of cells in each sample, and generally the sample with the greatest
   number of clusters, greatest diversity of clusters and highest number
   of cells gets the highest score and is chosen as the best reference.

   For each sample Inline graphic , let Inline graphic denote the number
   of clusters, and Inline graphic denote a weighted cell number value,
   defined as:
   graphic file with name M0003.gif

   Next, let Inline graphic denote the variance of cluster Inline graphic
   in sample Inline graphic . We compute a cluster-average variance score
   Inline graphic for sample Inline graphic by calculating the mean value
   of the variances of each cluster:
   graphic file with name M0009.gif

   Then, a reference score Inline graphic for sample Inline graphic is
   computed as:
   graphic file with name M00012.gif

   Finally, the sample Inline graphic with the maximum S score is chosen
   as the reference: Inline graphic .

   This is meant to automate the RISC recommendation that the sample with
   the most cell types is the preferred integration reference. As with the
   setting of all other parameters in scDAPP, the pipeline shows all the
   RISC InPlot graphs so that users can examine the underlying data and
   manually specify the reference sample via the ‘risc_reference’
   parameter in the run configuration file. Very importantly, scDAPP
   generates alluvial plots to illustrate cluster relationships between
   individual samples and integrated data, thus helping users to spot
   over- or under-integration.

Comparison of cell composition across groups

   After RISC integration, scDAPP performs cross-group comparison. First,
   differential cell composition analysis is applied to study cell
   population changes between groups by comparing the proportions of cells
   in each cluster across samples. As mentioned above, this can be done in
   a replicate-aware manner using the ‘Propeller’ test from the ‘Speckle’
   package, which is a t-test of the proportions for each cluster with
   samples in each group as replicates. Notably, scDAPP applies the
   square-root arcsine transformation that is provided as an option by
   Propeller, as this transformation was shown to perform best in a
   benchmarking analysis ([59]3). If replicates are not available or not
   considered, scDAPP will utilize the two proportion Z-test as
   implemented in the R prop.test function.

Differential expression analysis across groups

   As stated above, this can be done via a replicate-aware pseudobulk
   based method by setting ‘Pseudobulk_mode’ = ‘TRUE’. With this option,
   by default, the EdgeR likelihood ratio test (EdgeR-LRT) method is used,
   as this slightly outperformed default EdgeR and other pseudobulk
   methods like DESeq2 in a recent benchmark ([60]2). Optionally, users
   may use the ‘DE_test’ parameter to select either the EdgeR, EdgeR-LRT,
   DESeq2 or DESeq2-LRT test. The data used for these tests are raw counts
   (i.e. UMIs) aggregated over cells in each cluster per replicate for
   individual genes. The statistical outputs are combined with other
   important single-cell level information including the percent of cells
   expressing a gene in each condition, allowing downstream prioritization
   or further filtering of the differentially expressed genes (DEGs). If
   replicates are not available or not considered, differential expression
   can be run with ‘Pseudobulk_mode’ = ‘FALSE’ and performed using the
   Wilcoxon Rank-Sum test as implemented in the Seurat ‘FindMarkers’
   function by default, but importantly using the RISC batch-corrected
   gene expression values. Here too, users may optionally change to other
   Seurat statistical test (e.g. ‘MAST’) by passing arguments to the
   ‘DE_test’ parameter.

Pathway enrichment analysis

   Finally, scDAPP performs multiple types of function enrichment analysis
   using the differential gene expression results. This includes GSEA as
   implemented in the ‘fgsea’ R package ([61]22) and overrepresentation
   analysis (ORA). Notably, GSEA allows pre-ranking of all genes based on
   differential expression summary statistics. This obviates the need for
   arbitrary differential expression thresholds and works well with
   pseudobulk-based methods. Currently, scDAPP uses the gene sets from the
   MSIGDB pathway database, including the Hallmarks pathways, Gene
   Ontology (GO), KEGG, Reactome, and two transcription factor target
   databases, the Gene Transcription Regulation Database (GTRD) and Xie
   et al. Nature 2005 database ([62]23–27). We draw on these databases
   using the ‘msigdbr’ R package, which also allows flexibility across a
   range of select species with careful multi-source homology support for
   gene orthologs ([63]18). For the ORA option that is invoked by the
   ‘run_ORA’, scDAPP tests for significant overlaps between the DEGs in
   each cluster and pathway databases via the hypergeometric test as
   implemented in the ClusterProfiler package ([64]28).

Outputs of the scDAPP pipeline

   Once completed, scDAPP will save a variety of critical outputs from
   each stage of the pipeline, including a detailed HTML file (derived
   from R Markdown; containing extensive visualizations and tables for QC,
   intermediate results, and final results), text-format results,
   intermediate data files, and R objects (Figure [65]1D). The HTML report
   file ([66]Supplementary File 1 for an example) summarizes all steps of
   the pipeline and all processing and results, including quality control
   information, data behind threshold selections, UMAPs for clustering
   results, plots for marker genes, and cross-group analysis. Relevant
   codes are embedded and can be viewed in this report file. We use UMAP
   for visualization but do not recommend it for direct inference of the
   degree of similarity between clusters.

   Key result data tables are saved as .csv files, including results for
   marker genes, differential expression, differential abundance, and
   pathway analysis. High-resolution plots are also stored as .pdf files
   for each step. All these are included to make it transparent for users
   to track and understand the nuance and decisions in each step of the
   scRNA-seq analysis. The data is also valuable for users to adjust the
   parameters to re-run scDAPP until they are satisfied. In this sense,
   scDAPP is an excellent education tool for learning scRNA-seq analysis.
   Additionally, critical data objects are exported and saved, including
   clustered Seurat objects for each individual sample, a RISC object for
   the integrated data, and a Seurat object converted from the integrated
   data in the RISC format, and pseudobulking data. Users with
   advanced bioinformatics expertise can use them to seamlessly conduct
   further analyses with their own established workflows. Related to this,
   scDAPP has some utility R scripts for downstream analysis, such as a
   wrapper function for the recently described aPEAR algorithm for
   clustering enriched functions from GSEA ([67]29). Another downstream
   application includes preparation of interactive web applications. For
   this purpose, we provide a short vignette linking the output of scDAPP
   with the ShinyCell package, which is a user-friendly tool to easily
   export scRNA-seq objects to a web-based Shiny app ([68]30).

Results

   We have applied scDAPP successfully to many scRNA-seq and snRNA-seq
   data in our own studies. To demonstrate its performance, we included
   two instances below.

Reanalysis of COVID-19 Blood Atlas dataset with scDAPP revealed high
concordance with published findings

   We applied scDAPP to an atlas dataset of human blood cells from
   COVID-19 patients and healthy controls ([69]31). We selected three
   patient samples each from mild COVID, critical COVID, and controls and
   ran scDAPP in pseudobulk mode to account for replicates. The input
   contained nine samples (3 × 3) and a total of 83,356 cells. We ran
   scDAPP on a SLURM-based high performance cluster (HPC) with an
   allocation of 100GB memory and 9 CPUs, and completed in ∼14
   h (9 × 14 = 126 CPU hours). The main and full HTML report is in a
   supplemental file ([70]Supplementary File 1), with some key results
   included in Figure [71]2 to illustrate scDAPP functions, specifically
   UMAP of the clustering result, expression heatmap of the top cluster
   markers, cell population change, and altered pathways from GSEA. The
   run also used the label transfer option, taking an independent healthy
   control sample as the reference for cell annotation (provided by the
   original authors). Overall, our results are in strong concordance with
   the original publication, for example, increased abundance of
   plasmablast (PB) and B cells and upregulation of interferon responses
   in many cell types in critical COVID samples (Figure [72]2B).

Figure 2.

   [73]Figure 2.
   [74]Open in a new tab

   Reanalysis of blood scRNA-seq data from COVID-19 patients and controls
   with scDAPP. (A) Integrated UMAPs colored by clusters and separated by
   patient groups. (B) Table showing the label transfer result using a
   healthy non-COVID-19 sample from the same study (not included in the
   clustering and other re-analysis). MNP = ‘Mononuclear phagocytes’
   (monocytes / macrophages), PB = ‘plasmablast’, ‘nan’ indicates
   un-annotated cell types from the original published annotations. (C)
   Heatmap showing the expression of top markers computed for the clusters
   in (A). (D) Heatmap table showing the cell composition change between
   critical COVID-19 patient and healthy control samples. * P < 0.05. (E)
   Differential pathway analysis showing gene sets upregulated in critical
   COVID-19 patients versus healthy samples. All these figures were taken
   from the scDAPP output directly ([75]Supplementary File 1).

Reanalysis of mouse heart developmental scRNA-seq data with scDAPP
recapitulated published results

   We next applied scDAPP to a scRNA-seq dataset from mouse neural crest
   cells (NCCs), collected for studying the congenital heart defects in
   22q11.2 deletion syndrome ([76]32). The data were from heart tissues at
   embryonic day (E)10.5 of control embryos or embryos with Tbx1 knockout
   (Wnt1-Cre;Tbx1^-/-), a gene in the 22q11.2 region and required for
   cardiac development ([77]33–35). The input contained four E10.5 samples
   (2 Tbx1 wild-type and 2 Tbx1 knockout) and a total of 39,401 cells
   (Figure [78]3). We ran scDAPP with standard parameters except for the
   integrated clustering resolution (‘res_int’; set to 1) in order to
   obtain clusters closely matching the previous publication ([79]32), on
   a SLURM-based HPC with 50 GB memory allocation and 4 CPUs for ∼12 h.
   The RPCI reference sample automatically chosen by scDAPP was
   ‘Control2’, the same one selected manually in the previous paper
   (Figure [80]3). Taken the authors’ cell type annotation and markers
   from an E9.5 sample in the same study for label transfer, scDAPP was
   able to accurately assign cell types (Figure [81]3B, C). A close
   examination of the cluster relationship revealed differences, but most
   cells were clustered similarly by types in scDAPP and previous report
   (Figure [82]3D), suggesting that computational choice and software
   version could make subtle difference. The cell composition change
   between the controls and Tbx1 null embryos was also reproduced
   (difference in the craniofacial and outflow tract NCCs, corresponding
   to the scDAPP clusters 2 and 6) (Figure [83]3E). Finally, we compared
   the DEGs in the cardiac progenitor NCCs and found a large agreement
   between scDAPP and previous results, e.g. upregulation of Msx2, Bambi,
   Gata3, and Tbx2 in the Tbx1^-/- embryos (Figure [84]3F), all of which
   are critical downstream targets of Tbx1 ([85]32).

Figure 3.

   [86]Figure 3.
   [87]Open in a new tab

   Reanalysis of E10.5 Neural Crest Cell scRNA-seq data with scDAPP. (A)
   Table showing the scDAPP-computed RISC reference scores for the four
   samples, indicating that the sample ‘Control2’ had the highest score
   and was selected as the reference for RISC integration, consistent with
   what was selected manually in the original analysis. (B) Table showing
   the cluster annotation from label transfer using an independent E9.5
   NCC dataset, yielding the same cluster annotation as in the original
   publication. (C) UMAP plot of the integrated samples colored by
   clusters. (D) Table showing the relationship between clusters in
   published versus scDAPP re-analysis. Note that clusters 6 and 17 were
   not included in the original study, indicating different thresholds for
   cell filtering. (E). Table for cell compositional analysis showing
   significant decreases in cell number with Tbx1 KO in the clusters 2 and
   6, consistent with the original report. * P < 0.05. (F) Violin plots
   showing the differential expression of four key genes highlighted in
   the paper for cardiac progenitor neural crest cells (cluster 3 in
   scDAPP). Panels A, B, C and E were taken directly from the scDAPP
   output.

Discussion

   Whether scDAPP is used for a quick end-to-end bioinformatics analysis
   of the data or recurrent processing of the same data with different
   options, a main goal of its development is to help users visualize the
   granularity of sc/nRNA-seq analysis and achieve a quick transition from
   raw data to biological interpretation, while learning and exploring
   analytic parameters and options along the way. Based on our experience,
   we believe that the visualizations in the HTML output are extremely
   valuable for users, to examine data distribution, cluster marker
   specificity, cluster relationship, cell population shifts, biological
   relevant pathways and so on ([88]Supplementary File 1).

   Importantly, scDAPP emphasizes on full utilization of replicates for
   both differential gene expression analysis and differential cell
   composition analysis. The usage of replicates has repeatedly been
   demonstrated as a critical factor for specificity in single-cell
   transcriptomic comparative analysis. To our knowledge, scDAPP is the
   first such end-to-end pipeline to explicitly feature this
   replicate-aware approach, but we should note that there are other
   computational methods that use replicates but not pseudobulking.

   As scDAPP is a pipeline, we have not systematically benchmarked all the
   software by ourselves and rather have taken the recommendations from
   the community. The modular design, however, provides sufficient
   flexibility for including additional software in the future. For
   example, modified Dirichlet models were also shown to perform very well
   for differential cluster abundance analysis ([89]3), and thus may be
   included to complement Propeller. Additionally, new methods based on
   combinatorial indexing are now capable of producing datasets with
   hundreds of thousands to millions of cells per sample ([90]36). Such
   technological advancements represent a paradigm shift for the
   single-cell field in general and may require implementation of
   specialized tools relying on highly optimized memory storage or
   combining cells together into metacells, important functions to
   consider for future scDAPP releases.

Supplementary Material

   lqae134_Supplemental_Files
   [91]lqae134_supplemental_files.zip^ (71.1MB, zip)

Acknowledgements