Graphical abstract

   graphic file with name fx1.jpg
   [26]Open in a new tab

Highlights

     * •
       scRNA-seq analysis of lymphocytes of healthy and Mycobacterium
       tuberculosis-infected mice lung
     * •
       Pipeline will integrate samples across different conditions
     * •
       Allows identification of cell types based on both known and unique
       markers
     * •
       Explore transcriptional activities across diverse conditions
     __________________________________________________________________

   Publisher’s note: Undertaking any experimental protocol requires
   adherence to local institutional guidelines for laboratory safety and
   ethics.
     __________________________________________________________________

   Processing and analyzing single-cell RNA-seq (scRNA-seq) from lung
   cells are challenging due to the complexity of cell subtypes and
   biological variations within sample groups. Here, we present a protocol
   for performing an in-depth assessment on lung lymphocyte populations
   derived from healthy and Mycobacterium tuberculosis-infected mice. We
   describe steps for downloading processed scRNA-seq data, integrating
   samples across different conditions, and performing cluster analysis.
   We then detail procedures for identifying lymphoid cell subtypes,
   differential analysis, and pathway enrichment analysis.

Before you begin

     Inline graphic Timing: 30 min

   This manuscript details the steps necessary to identify lymphoid cell
   populations at the single cell level in the mouse lung. Comprehensive
   steps for single-cell RNA-seq data cluster analysis and cell type
   identification are discussed here. In different clusters, the process
   to perform differential analysis and enriched pathways while comparing
   among different groups is also described. In silico validations are
   also performed in Akter et al.[27]^1
     * 1.
       Download the dataset.

     Note: The study includes uninfected (n = 2), Mtb-infected at 50 days
     post-infection (d50) (n = 3) and 100 days post infection (d100) (n =
     3) scRNA-seq samples from mice lung.
          + a.
            Download the dataset listed in the “Deposited data” section in
            the [28]key resources table.
          + b.
            Download the cellranger processed matrices where we have three
            files (barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz)
            for each sample.
          + c.
            Save these in folders named by each sample under a directory
            name “cellranger_out”.
     * 2.
       Install R on your machine.

     Note: Most of this protocol utilizes R language (R software v.3.5.3)
     for data processing and analysis.

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   scRNA-seq, raw, and analyzed data Akter et al.[29]^1 GEO: [30]GSE200639
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Cellranger 3.1 10x Genomics [31]https://support.10xgenomics.com/
   Seurat 3 Stuart et al.[32]^2 [33]https://satijalab.org/seurat
   ggplot2 Wickham[34]^3 [35]https://cloud.r-project.org/package=ggplot2
   biomaRt Durinck et al.[36]^4
   [37]https://bioconductor.org/packages/biomaRt
   PANTHER Mi and Thomas[38]^5 [39]http://www.pantherdb.org/
   [40]Open in a new tab

Materials and equipment

Bioinformatics analysis

   All bioinformatics analyses have been carried out on the University of
   Chicago’s high-performance scientific compute platform (step 1-6 of the
   analysis) or on a local computer (step 7-8). We allocate a single CPU
   and 100 GB RAM in the institutional high-performance scientific
   computing environment. The local PC is a MacBook Pro with 2.3 GHz
   Quad-Core Intel Core i7 processor, 16 GB RAM).

Step-by-step method details

Processing the single-cell RNA-seq data

     Inline graphic Timing: ∼ 4 h

   These steps help researchers to preprocess the single-cell data and get
   it ready for downstream analysis once it comes out of the machine.
     * 1.
       Quality Control.

     Note: This section describes the process of quality control and
     decides on the filtering criteria of the scRNA-seq dataset including
     the installation of required R packages.
          + a.
            Installation and set up: Install all the R packages and set up
            the directories first and load the dataset in R.

     Note: We use several R packages to generate figures and intermediate
     data preprocessing, such as Seurat[41]^2 package (v.3.0.0),
     ggplot2,[42]^3 and biomaRt.[43]^4
          + b.
            Explore the data: To decide on the filtering criteria, explore
            the data and plot the total genes per cell and total
            mitochondrial genes per cell on a log scale.
               o i.
                 First filter and count a gene only if it has non-zero
                 reads mapped. Then plot genes per cell for each sample
                 separately.
               o ii.
                 Second, count the percentage of mitochondrial genes and
                 plot mitochondrial genes per cell for each sample
                 separately.
                 These steps help to decide on the filtering threshold.
                 # 1a. Installation and setting up the directories
                 install.packages('Seurat')
                 library(Seurat)
                 library(ggplot2)
                 #please replace the “/PATH/TO/YOUR/DIRECTORY” with your
                 project directory
                 setwd(/PATH/TO/YOUR/DIRECTORY)
                 plots<- "./output/"
                 cellranger_out<-"./cellranger_output/"
                 # 1b: Explore the data #Change sample names as needed
                 samples<- c("TB1000A1","TB1000A2","TB1000A3","TB1000A4",
                    "TB1000A5","TB1000A6","TB1000A7","TB1000A8")
                 for (i in 1:length(samples)){
                  data <- Read10X(data.dir = paste0(cellranger_out,
                 samples[i]))
                 genes_per_cell <- Matrix::colSums(data>0)
                 pdf(file=paste0(plots,samples[i],"_genes_per_cell_ordered
                 .pdf"),paper="USr",width=11)
                  par(oma=c(0,0,2,0))
                  par(mfrow=c(1,1))
                  plot(sort(genes_per_cell), xlab='cell', log='y',
                 main='genes per cell (ordered)')
                  title(main=paste0(samples[i]),outer=T)
                  dev.off()
                  mito.genes<-grep(pattern="ˆmt-",x = rownames(x=data),
                 value=TRUE)
                  mito_gene_read_counts =
                 Matrix::colSums(data[mito.genes,])
                 # compute percentage of mitochondrial genes per cell
                  pct_mito = mito_gene_read_counts / counts_per_cell ∗ 100
                 pdf(file=paste0(plots,samples[i],"_mito_gene_per_cell.pdf
                 "),paper="USr",width=11)
                  par(oma=c(0,0,2,0))
                  par(mfrow=c(1,1))
                  plot(sort(pct_mito))
                  title(main=paste0(samples[i]),outer=T)
                  dev.off()
                 }
     * 2.
       Create Seurat objects.

   Read the scRNA-seq dataset and create Seurat objects. Add the
   percentage of mitochondrial genes in the Seurat metadata and the
   conditions of the samples: uninfected, infected at d50, and infected at
   d100.

   ## 2: create seurat objects

   # Set up condition 1: uninfected object

   for (file in c("TB1000A1","TB1000A2")){

    data <- Read10X(data.dir = paste0(cellranger_out, file))

    data_obj <- CreateSeuratObject(counts = data, min.features = 200,
   min.cells = 3, project = file)

   mito.genes <- grep(pattern = "ˆmt-", x = rownames(x = data_obj),
   value = TRUE)

   percent.mito <- Matrix::colSums(x = GetAssayData(object = data_obj,
   slot = 'counts')[mito.genes, ]) / Matrix::colSums(x =
   GetAssayData(object = data_obj, slot = 'counts'))

    data_obj[['percent.mito']] <- percent.mito

    data_obj[["Dataset"]]<-file

    data_obj[["condition"]]<-"uninfected"

    assign(file, data_obj)

   }

   # Set up condition 2: infected d50 object

   for (file in c("TB1000A3","TB1000A4","TB1000A5")){

    data <- Read10X(data.dir = paste0(cellranger_out, file))

    data_obj <- CreateSeuratObject(counts = data, min.features = 200,
   min.cells = 3, project = file)

    mito.genes <- grep(pattern = "ˆmt-", x = rownames(x = data_obj),
   value = TRUE)

    percent.mito <- Matrix::colSums(x = GetAssayData(object = data_obj,
   slot = 'counts')[mito.genes, ]) / Matrix::colSums(x =
   GetAssayData(object = data_obj, slot = 'counts'))

    data_obj[['percent.mito']] <- percent.mito

    data_obj[["Dataset"]]<-file

    data_obj[["condition"]]<-"infected_d50"

    assign(file, data_obj)

   }

   #Set up condition 3: infected d100 object

   for (file in c("TB1000A6","TB1000A7","TB1000A8")){

    data <- Read10X(data.dir = paste0(cellranger_out, file))

    data_obj <- CreateSeuratObject(counts = data, min.features = 200,
   min.cells = 3, project = file)

    mito.genes <- grep(pattern = "ˆmt-", x = rownames(x = data_obj),
   value = TRUE)

    percent.mito <- Matrix::colSums(x = GetAssayData(object = data_obj,
   slot = 'counts')[mito.genes, ]) / Matrix::colSums(x =
   GetAssayData(object = data_obj, slot = 'counts'))

    data_obj[['percent.mito']] <- percent.mito

    data_obj[["Dataset"]]<-file

    data_obj[["condition"]]<-"infected_d100"

    assign(file, data_obj)

   }

   rm(data)

   rm(data_obj)

   gc()
     * 3.
       Filter followed by normalization.
          + a.
            Filter cells that have unique feature counts over certain
            threshold. Thresholds are decided based on the plots of step
            1bi (genes_per_cell_ordered.pdf) where “sorted genes per cell”
            is more consistent within the thresholds and sharply falls
            otherwise.

     Note: We filter cells that have unique feature counts over 4,500 or
     less than 1,100.
          + b.
            Set threshold for the percentage of mitochondrial genes based
            on the mitochondrial gene plots of step 1bii for each of the
            sample (mito_gene_per_cell.pdf). Beyond this threshold, there
            is a sharp increase in the plot.

     Note: We choose different thresholds for mitochondrial genes because
     of their distinct distribution in each sample.
          + c.
            Merge samples with similar conditions.
          + d.
            Normalize data with default parameters, and detect most
            variable genes using the FindVariableFeatures function.

     Note: All the samples of this study are sequenced in one batch. So,
     we merge samples with similar conditions as these are biological
     replicates. If samples are obtained from different batches, user
     should check and normalize for batch effects.
            3. Filtering and normalization
            TB1000A1 <- subset(TB1000A1, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 8)
            TB1000A2 <- subset(TB1000A2, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 5)
            # TB1000A1 and TB1000A2 are uninfected
            TB1000A12 <- merge(x = TB1000A1, y = c(TB1000A2),
            add.cell.id = ("TB1000A1","TB1000A2"))
            TB1000A12 <- NormalizeData(TB1000A12, verbose = FALSE)
            TB1000A12 <- FindVariableFeatures(TB1000A12,
            selection.method = "vst", nfeatures = 2000)
            TB1000A3 <- subset(TB1000A3, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 8)
            TB1000A4 <- subset(TB1000A4, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 9)
            TB1000A5 <- subset(TB1000A5, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 6)
            # TB1000A3, TB1000A4 and TB1000A5 are infected at 50dpi
            TB1000A345 <- merge(x = TB1000A3, y = c(TB1000A4,TB1000A5),
            add.cell.id = c("TB1000A3","TB1000A4","TB1000A5"))
            TB1000A345 <- NormalizeData(TB1000A345, verbose = FALSE)
            TB1000A345 <- FindVariableFeatures(TB1000A345,
            selection.method = "vst", nfeatures = 2000)
            TB1000A6 <- subset(TB1000A6, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 6)
            TB1000A7 <- subset(TB1000A7, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 7)
            TB1000A8 <- subset(TB1000A8, subset = nFeature_RNA >1100 &
            nFeature_RNA < 4500 & percent.mito < 8)
            # TB1000A6, TB1000A7 and TB1000A8 are infected at 100dpi
            TB1000A678 <- merge(x = TB1000A6, y = c(TB1000A7,TB1000A8),
            add.cell.id =c("TB1000A6","TB1000A7","TB1000A8"))
            TB1000A678 <- NormalizeData(TB1000A678, verbose = FALSE)
            TB1000A678 <- FindVariableFeatures(TB1000A678,
            selection.method = "vst", nfeatures = 2000)
     * 4.
       Perform Integration and integrated analysis.
          + a.
            Combine all samples using Seurat functions,
            FindIntegrationAnchors and IntegrateData.

     Note: This step integrates samples from different conditions.
          + b.
            Use ScaleData to regress out a number of UMI’s and
            mitochondrial content.
          + c.
            Perform Principal component analysis (PCA) with RunPCA.
          + d.
            Use ElbowPlot to visualize the standard deviation of each PCA.

     Note:ElbowPlot helps to determine how many PCAs are needed to
     capture most of the variation in the data. In this study, we found
     17 PCAs can capture majority of the variations.
          + e.
            Perform t-SNE dimensionality reduction on the scaled matrix.

     Note: We use the first 17 PCA for dimensionality reduction.
          + f.
            The resolution parameter of the FindClusters function controls
            the granularity of the clustering, with higher values leading
            to more clusters.

     Note: There is no universally 'correct' resolution. However, user
     can try different values within this range and see which one best
     reflects data's biological variability.
          + g.
            Next, visualize the clusters ([44]Figure 1) and save the
            integrated Seurat object (“immune.combined”).
            # 4. Perform Integration and integrated analysis
            immune.anchors <- FindIntegrationAnchors(object.list =
            list(TB1000A12, TB1000A345, TB1000A678), dims = 1:20)
            immune.combined <- IntegrateData(anchorset = immune.anchors,
            dims = 1:20)
            # Run the standard workflow for visualization and clustering
            all.genes <- rownames(immune.combined)
            immune.combined <- ScaleData(immune.combined, features =
            all.genes)
            #Perform linear dimensional reduction
            immune.combined <- RunPCA(immune.combined, npcs =
            30,features = VariableFeatures(object = immune.combined))
            DimPlot(immune.combined, reduction = "pca") + NoLegend()
            #ElbowPlot which visualizes the standard deviation
            ElbowPlot(immune.combined)
            # Clustering
            immune.combined <- FindNeighbors(immune.combined, reduction =
            "pca", dims = 1:17)
            immune.combined <- FindClusters(immune.combined, resolution =
            c(0.4))
            # tSNE and UMAP
            immune.combined <- RunUMAP(immune.combined, reduction = "pca",
            dims = 1:17)
            immune.combined <- RunTSNE(immune.combined, reduction = "pca",
            dims = 1:17)
            #[45]Figure 1
            TSNEPlot(object = immune.combined, label=F)
            saveRDS(immune.combined, file = "./immune.combined.rds")

Figure 1.

   [46]Figure 1
   [47]Open in a new tab

   t-Distributed stochastic neighbor embedding (tSNE) visualization of
   different subsets of cells (all conditions together), colored according
   to different clusters

Cluster analysis and cell type identification

     Inline graphic Timing: ∼ 4 h

   These steps help researchers to perform cluster analysis and to know
   the process of identification of cell types.
     * 5.
       Identify major cell types.
       Identify the non-immune and major immune cell types using the key
       genes and top marker genes.

     Note: Since in this study, most of the cells are lymphoid cells due
     to the prior purification for live cells,[48]^1 we focus on
     characterizing the lymphoid cell subsets only.
       The composition of cell types varies across different tissues. For
       instance, brain tissue is characterized by the presence of neurons,
       astrocytes, oligodendrocytes, and microglia, whereas blood
       typically contains lymphocytes, neutrophils, monocytes, among
       others. Such cellular diversity or heterogeneity has significant
       implications for the interpretation of scRNA-seq data. Hence, it is
       crucial to account for these differences during clustering and
       differential expression analyses.
       Furthermore, tissue samples may be subject to batch effects. These
       effects could occur if the samples undergo processing at different
       time points or if they are handled by distinct individuals. Tools
       like Seurat and Harmony offer features for batch correction,
       including Integration and Harmony methods, that are invaluable in
       managing these potential discrepancies.
          + a.
            Identify the marker genes in each cluster and focused on the
            top 10 markers based on p-values.

     Note: These are the key genes that help to understand the type of
     cells within the cluster. If the user uses a specific database to
     determine the marker genes of any particular cell type, the user
     should also consider determining if gene signatures in related
     tissue type as well, since gene expressions can vary based on the
     type of tissue.
          + b.
            Look into the expression of known marker genes.

     Note: Immune cells can be identified as clusters expressing Ptprc,
     T cells expressing CD3d, Cd4, CD8b1, Natural killer (NK) cells
     expressing Gzmb, Ncr1, B cells expressing CD79a, etc ([49]Figure 2).
     Known maker genes helps to identify major cell types, and with that
     top markers of each cluster can enable further clustering of cell
     sub populations.
          + c.
            Name the clusters by their corresponding major cell types to
            visualize.[50]^1

     Note: Identify the major cell types only as we will do re-clustering
     the lymphoid cells in the next step since majority of the cells in
     this dataset are lymphoid cells.
            # 5a: Identify top 10 markers of each cluster
            all.markers = FindAllMarkers(immune.combined, min.pct = 0.25,
            logfc.threshold = 0.25, only.pos = TRUE)
            # sort all the markers by p-value
            all.markers.sortedByPval =
            all.markers[order(all.markers$p_val),]
            top10 <- all.markers.sortedByPval %>% group_by(cluster) %>%
            do(head(., n=10))
            write.csv(top10,"top10_findallMarkers_onlyPos.csv")
            # 5b: Look into the expression of known markers
            DefaultAssay(immune.combined) <- "RNA"
            #[51]Figure 2
            FeaturePlot(object = immune.combined, features = c( "Ptprc",
            "Cd3d",
            "Cd4", "Cd8b1", "Gzmb","Ncr1","Cd79a", "Mki67", "Cd7"), cols =
            c("grey", "blue"), reduction = "tsne")
            # 5c: tSNE visualization colored by major cell-types
            new.cluster.ids <-
            c("lymphoid","lymphoid","lymphoid","lymphoid","lymphoid","lymp
            hoid","lymphoid","lymphoid","lymphoid","lymphoid","lymphoid",
                 "lymphoid","myeloid","myeloid","non-immune","lymphoid","l
            ymphoid","lymphoid","myeloid","myeloid","myeloid")
            names(new.cluster.ids) <- levels(immune.combined)
            immune.combined <- RenameIdents(immune.combined,
            new.cluster.ids)
            TSNEPlot(object = immune.combined)

     * 6.
       Re-cluster the lymphoid cells only and identify the cell subtypes.
       To identify the cell subtypes of each cluster, subset the lymphoid
       cells and perform cluster analysis.
          + a.
            Re-clustering with lymphoid cells only.

     Note: The process is similar to the previous cluster analysis with
     all cells. To obtain a better resolution of cell subtypes,
     re-cluster analysis can be performed.
          + b.
            Find top markers of each lymphoid cluster.

     Note: Top markers help to identify the cell subtypes of each
     cluster.
          + c.
            Expression of top marker genes.

     Note: Investigate the expression of interested top marker genes.
          + d.
            Expression of known marker genes.

     Note: Investigate any known lymphoid marker genes along with top
     marker genes to know the cell type of a cluster. Using only one
     method is not enough to get a cell identity.[52]^1
          + e.
            Plot the clusters. Plot can be split for each condition
            ([53]Figure 3).

     Note: Clusters can also be renamed based on their respective cell
     identity and then visualized.
            # 6a: Re-clustering with lymphoid cells only
            DefaultAssay(immune.combined) <- "integrated"
            lymphoid_cluster <- subset(immune.combined, idents =
            c(0,1,2,3,4,5,6,7,8,9,10,11,15,16,17))
            all.genes <- rownames(lymphoid_cluster)
            lymphoid_cluster <- ScaleData(lymphoid_cluster, features =
            all.genes)
            lymphoid_cluster <- RunPCA(lymphoid_cluster, npcs =
            30,features = VariableFeatures(object = lymphoid_cluster))
            lymphoid_cluster <- RunTSNE(lymphoid_cluster, reduction =
            "pca", dims = 1:20)
            lymphoid_cluster <- FindNeighbors(lymphoid_cluster,
            reduction = "pca", dims = 1:20)
            lymphoid_cluster <- FindClusters(lymphoid_cluster,
            resolution = c(0.4))
            saveRDS(lymphoid_cluster, file = "./lymphoid.combined.rds")
            # 6b: Find markers of each lymphoid cluster
            all.markers = FindAllMarkers(lymphoid_cluster, min.pct = 0.25,
            logfc.threshold = 0.25, only.pos = TRUE)
            # sort all the markers by p-value
            all.markers.sortedByPval =
            all.markers[order(all.markers$p_val),]
            top10 <- all.markers.sortedByPval %>% group_by(cluster) %>%
            do(head(., n=10))
            write.csv(top10,"top10_findallMarkers_lymphoid_onlyPos.csv")
            # 6c: Expression of top genes in different clusters
            markers.to.plot <-
            c("Ebf1","Ifi30","Ifit3","Ifi27l2a","Vpreb3", "Spib",
            Sp140","Rtp4","Fcmr","Ms4a1","Mzb1","Il7r","Tcf7","Lef1","Ctla
            4","Cxcr3",
            "Cxcr6","Icos","Maf","Tnfrsf4","Cd7","Ctla2a","Il2rb","Ccl5","
            Cx3cr1","Zeb2","Lag3","Cd40lg","Stat1","Igtp","Ifng","Rora","S
            100a4","Gbp4","Gbp2","Fcer1g","Gzmb",
            "Gzma","Gzmk","Tnfrsf17", "Foxp3")
            DotPlot(lymphoid_cluster, features = rev(markers.to.plot),
            cols = c("blue", "red"), dot.scale = 8) + RotatedAxis()
            # 6d: Expression of known marker genes
            DefaultAssay(lymphoid_cluster) <- "RNA"
            FeaturePlot(object = lymphoid_cluster, features =
            c("Ptprc","Cd79a","Ms4a1","Cd3d","Cd3e","Cd4","Cd8b1","Cd8a","
            Gzmb","Ncr1", "Sell","Ccr7","Cd44","Ifng","Rora","Foxp3"),
            cols = c("grey", "blue"), reduction = "tsne")
            #[54]Figure 3
            DimPlot(lymphoid_cluster, reduction = "tsne",label = TRUE,
            split.by = "condition")
            # 6e: Rename the clusters with respective subtypes
            new.cluster.ids <- c("B naive","CD8+T naive","NK_1","CD4+T
            naive","CD4+T
            act_1","CD8+T eff","CD4+T IFN+","CD8+T act_1","CD8+T act_2",
                  "CD4+T act_2","B act_1","B mem","B/T
            doublets","NK_2","NK_3","B act_2","Plasma_1","Plasma_2")
            lymphoid_cluster <- RenameIdents(lymphoid_cluster,
            new.cluster.ids)
            #To visualize the clusters by labeling the cell sub-types
            #all together
            DimPlot(lymphoid_cluster, reduction = "tsne",label = TRUE)
            #split by conditions
            DimPlot(lymphoid_cluster, reduction = "tsne",label = TRUE,
            split.by = "condition")

Figure 2.

   [55]Figure 2
   [56]Open in a new tab

   tSNE plot with the expression of different known cell markers

   The expressions of these genes help to characterize distinct major cell
   types.

Figure 3.

   [57]Figure 3
   [58]Open in a new tab

   tSNE visualization of different subsets of lymphoid cells, colored
   according to cellular identity, split by conditions

   Uninfected, n = 2; infected D50, n = 3; infected D100, n = 3.

Transcriptional differences among different conditions

     Inline graphic Timing: ∼ 4 h

   These steps help researchers to perform differential analysis of
   samples from different conditions. This section also discusses the
   process pathway enrichment analysis using PANTHER.[59]^5

     Note: In this study, we want to identify the transcriptional changes
     in the mouse lung due to the Mtb infection.

     * 7.
       Differential analysis.

   Use Findmarkers from Seurat for the differential analysis with default
   parameters. Change the assay to “RNA” before performing the analysis.

     Note: In this study, we compare infected vs uninfected mice
     separately for d50 and d100 (infected d50 vs uninfected and infected
     d50 vs uninfected), and for each cluster at a time point.

   # 7: Differential analysis to compare infected vs uninfected mice

   lymphoid_cluster$condition.DE <- paste(Idents(lymphoid_cluster),
   lymphoid_cluster$condition, sep = "_")

   lymphoid_cluster$condition <- Idents(lymphoid_cluster)

   Idents(lymphoid_cluster) <- "condition.DE"

   total_clusters<-18

   for (i in 0:(total_clusters-1)){

    rm(infected_d50_vs_uninfected)

    rm(infected_d100_vs_uninfected)

    infected_d50_vs_uninfected<- FindMarkers(lymphoid_cluster, ident.1 =
   paste0(i,"_infec_d50"), ident.2 = paste0(i,"_uninfected"), verbose =
   FALSE)

    infected_d100_vs_uninfected<- FindMarkers(lymphoid_cluster, ident.1 =
   paste0(i,"_infec_d100"), ident.2 = paste0(i,"_uninfected"), verbose =
   FALSE)

    infected_d50_vs_uninfected<- subset(infected_d50_vs_uninfected,
   p_val_adj<0.05)

    infected_d100_vs_uninfected<- subset(infected_d100_vs_uninfected,
   p_val_adj<0.05)

   write.csv(infected_d50_vs_uninfected,paste0("./DE_cluster_",i,"_infecte
   d_d50_vs_uninfected.csv"), row.names = FALSE)

   write.csv(infected_d100_vs_uninfected,paste0("./DE_cluster_",i,"_infect
   ed_d100_vs_uninfected.csv"), row.names = FALSE)

   }
     * 8.
       Pathway enrichment analysis.

   Perform over-representation test using the PANTHER website.[60]^5
   PANTHER is a web-based tool where user can provide the differential
   gene list, choose the parameters and the software enables the analysis.
   This need to be done separately for each gene list (differentially
   expressed genes) and up/down regulated.

     Note: We use the Reactome pathway database with Fisher’s exact test
     and corrected for multiple tests using a false discovery rate with
     default parameter settings as calculated by the Benjamini-Hochberg
     procedure.[61]^6

Expected outcomes

   This protocol guides the process of analyzing scRNA-seq data from mouse
   lungs to perform cluster analysis, identify enriched genes, cell type
   identification of each cluster, and differential analysis among
   different conditions in each cluster followed by identification of
   enriched pathways.

   Using this protocol, one can identify the major types of lymphoid
   cells, for example, Cd4^+/Cd8^+ T cells, B cells, NK cells and
   different sub-types of T/B (memory, naïve, activated) cells.

Limitations

   This study has some limitations. For example, this protocol focuses
   mainly on the lymphoid cells. Most of the cells were lymphoid cells due
   to the prior purification for live cells, which resulted in the
   enrichment of lymphoid populations to ∼96% of the total cells; only
   2.7% were myeloid and 0.82% were non-immune cells.[62]^1 As such we
   have not discussed other type of cells and related markers. Further, we
   use IntegrateData from Seurat to integrate samples, which requires
   intensive memory for large genomes (e.g., human, monkey) and big sample
   size. In such cases, researchers need to use high-performance computing
   resources to perform the analysis.

Troubleshooting

Problem 1

   Some steps of the data processing might require a higher processor and
   memory.

Potential solution

   It is better to do the sample integration in high-performance computer
   if the data size is big and/or the genome is large. We have eight
   samples from three different conditions. Therefore, we use Seurat
   integration method, IntegrateData, which requires intensive memory.
   This method is suggested to use for batch effect correction and to
   perform comparative scRNA-seq analysis across different conditions.
   Alternatively, researchers can use Harmony.[63]^7

Problem 2

   Assigning cell sub-type is critical. No tool or software can solely
   identify all cell subtypes.

Potential solution

   It is better to investigate different databases as well as literature
   to identify cell sub-types. The latest peer reviewed articles on the
   same organism and same source of cells can be helpful to identify the
   cell identity. Also, it is better to determine the cell sub-types based
   on both the list of top marker genes and the expression of known
   markers. One can also apply existing tools to classify cell-types, such
   as Azimuth,[64]^8 SC3,[65]^9 CellAssign,[66]^10 and SingleR.[67]^11
   These tools usually use machine learning or other statistical methods
   to differentiate between cell types based on unique gene expression
   patterns. The choice of tool depends on researcher’s specific goals and
   available resources.

Problem 3

   This manuscript and its reference article focus on lung lymphocytes
   only and so it discusses about the known marker genes of lymphocytes.

Potential solution

   Since about 96% cells of the dataset used here are lymphoid cells due
   to method of isolation,[68]^1 we focus on the markers of lymphoid cells
   only. If you have other type of cells, such as myeloid, non-immune
   cells, you can look into resources such as Human cell atlas, databases
   like CellMarker,[69]^12 and PanglaoDB[70]^13 and latest peer reviewed
   research articles.

Problem 4

   A cluster might have cells that express mix type of cell markers, such
   as both lymphoid and myeloid.

Potential solution

   If you have any cluster that has cells expressing both lymphoid and
   myeloid markers, those could potentially be doublets. It is crucial to
   carefully examine these instances and exclude them from subsequent
   analyses.

Problem 5

   Sometimes user might face difficulty to load the data into R.

Potential solution

   Please ensure that the files are present in the directory from which
   user is attempting to access the files. It’s also important to verify
   the accuracy of the file and folder names.

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Shabaana A. Khader
   ([71]khader@uchicago.edu).

Materials availability

   This study did not generate any new unique reagents.

Acknowledgments