This report presents a reproducible analysis pipeline for a single-cell RNA-seq (scRNA-seq) dataset. The workflow is adapted from existing analysis scripts (Ziemann Lab, sc_macrophage repository) and materials from Workshop 6 of the bioinformatics workshop series (Ziemann Lab, 2025). Differential expression was performed using established Bioconductor workflows (Love et al., 2014). Pathway enrichment analysis was performed using the fgsea Bioconductor package, which implements a fast preranked gene set enrichment algorithm (Sergushichev, 2016). Gene sets were obtained from the Reactome pathway database (Jassal et al., 2020).
In my current work, scRNA-seq is used to investigate cellular and transcriptional dynamics of HIV infection in primary macrophages. The dataset analysed here includes samples from multiple donors and experimental batches, with cells stratified by HIV/GFP expression status. The complexity of this experimental design necessitates a robust and reusable analysis pipeline to enable consistent data processing, exploration, and presentation.
The primary aim of this analysis is to identify transcriptional and pathway-level differences between monocyte-derived macrophages (MDM) and alveolar monocyte–derived macrophages (AlvMDM). In addition, the analysis seeks to characterise gene expression and pathway changes associated with HIV infection. Identifying key differentially expressed genes and enriched pathways provides insight into biological mechanisms underlying macrophage identity and infection-driven responses. For the purpose of this analysis we will be investigating transcriptional differences between mock- and productively-infected (referred to as ‘active’ for simplicity) MDM.
The pipeline performs the following key steps:
1. Import, checking, and cleaning of scRNA-seq data
2. Basic quality control to assess read counts and mitochondrial content
3. Principal component analysis to quantify within and between group variation
4. Pseudobulk differential expression analysis
5. Visualisation of key results, including gene- and pathway-level summaries
Note: data acquisition and demultiplexing prior to step 1 are not performed in this script see https://github.com/markziemann/sc_macrophage/blob/main/scanalysis.Rmd for SingleCellExperiment (SCE) object generation. To summarise; Multiplexed 10x Genomics scRNA-seq libraries were generated from mock and HIV-infected macrophages (MDM, N=4) and alveolar macrophages (AlvMDM, N=3), with pooled donors and hashtag oligonucleotides (HTOs) where applicable. Raw FASTQ files were aligned to a combined human-HIV reference using Cell Ranger to produce filtered gene expression matrices and whitelisted cell barcodes. HTO counts were quantified with CITE-seq, and donor identities were assigned via SNP-based demultiplexing using cellSNP-lite and vireo. Low-quality or ambiguous cells were filtered, and experimental conditions and donor identities were validated with HTO signals and HIV transcript abundance. This outputs a SingleCellExperiment (SCE) object with gene expression, HTO counts, donor assignments, and QC metadata.
This analysis is directly relevant to my current research, where scRNA-seq datasets are being generated across multiple experimental conditions. The development of a reproducible pipeline enables consistent re-analysis as new samples are added and supports the efficient generation of figures and tables for reports, presentations, and future publications.
This report is written as an R Markdown document and should be rendered from top to bottom.
Requirements:
- R (≥ 4.2; tested with R 4.3.3)
- Internet access (for downloading the Reactome GMT file)
- The file sce.Rds downloaded from: https://ziemann-lab.net/public/anna/macrophage_may2025/sce.Rds
To run:
1. Place sce.Rds in the same directory as this Rmd file.
2. Open the Rmd in RStudio.
3. Click Knit to render the HTML report.
Load packages
library(SingleCellExperiment)
library(dplyr)
library(scran)
library(scater)
library(DESeq2)
library(gplots)
library(kableExtra)
library(Matrix)
library(fgsea)
library(irlba)
library(ggplot2)
library(patchwork)
library(BiocParallel)
library(DT)
The scRNA-seq dataset was loaded into a SingleCellExperiment object.
sce <- readRDS("sce.Rds")
sce
## class: SingleCellExperiment
## dim: 36622 24311
## metadata(0):
## assays(2): counts logcounts
## rownames(36622): HIV-LTRR HIV-LTRU5 ... AC007325.4 AC007325.2
## rowData names(0):
## colnames(24311): mdm_mock1|AAACGAATCACATACG mdm_mock1|AAACGCTCATCAGCGC
## ... react6|TTTGTTGTCTGAACGT react6|TTTGTTGTCTTGGCTC
## colData names(13): orig.ident nCount_RNA ... monaco_fine_pruned_labels
## ident
## reducedDimNames(3): PCA HARMONY UMAP
## mainExpName: RNA
## altExpNames(0):
# Remove old QC columns if present
qc_cols <- grep("^subsets_Mt|^sum$|^detected$|^total$", colnames(colData(sce)), value = TRUE)
if(length(qc_cols) > 0) {
colData(sce)[, qc_cols] <- NULL
}
# Identify mitochondrial genes
is.mito <- grepl("^MT-", rownames(sce))
# Add per-cell QC metrics
sce <- addPerCellQC(sce, subsets = list(Mt = is.mito), assay.type = "counts")
# Create data frame for plotting
df <- as.data.frame(colData(sce))
Quality control metrics were visualised to assess gene detection rates and mitochondrial content across samples.
Note: alv preffix represents the AlvMDM cell type. Bystander, latent, and reactivated/‘react’ cell types are not being used in this example but exist in this dataset.
low_lib <- sce$sum < 200 # to remove empty droplets / low sequencng depth
high_lib <- sce$sum > 100000 # to remove doublets
low_detected <- sce$detected < 1500 # to remove low-quality cells and low numbers of detected genes, would consider making this more lenient if analysing latent cells
high_detected <- sce$detected > 8000 # remove very complex/doublet-like cells
high_mito <- sce$subsets_Mt_percent > 15 # fixed cutoff (standard)
#combine all filters
discard_cells <- low_lib | high_lib | low_detected | high_detected | high_mito
#subset
sce_sub <- sce[, !discard_cells]
# Single-cell gene-level metrics (ANNOTATION ONLY — no filtering)
counts_mat <- counts(sce_sub)
rowData(sce_sub)$sc_genes_detected <- rowSums(counts_mat > 0)
rowData(sce_sub)$sc_total_counts <- rowSums(counts_mat)
rowData(sce_sub)$sc_keep_gene <- rowData(sce_sub)$sc_genes_detected >= 4 &
rowData(sce_sub)$sc_total_counts >= 20
# Compute library size per cell if not already present
if(!"sum" %in% colnames(colData(sce_sub))) {
sce_sub$sum <- colSums(counts(sce_sub))
}
# Now compute total UMIs per demultiplexed sample
library_sizes <- tapply(
sce_sub$sum,
sce_sub$sample,
sum
)
The numbers refer to different donors within the dataset.
# Deconvolution-based size factors (robust for scRNA-seq)
register(MulticoreParam(2)) # use 2 cores
time_taken_deconvol <- system.time({
sce_sub <- computeSumFactors(sce_sub, BPPARAM=MulticoreParam(2)) # can take a while so use parallel computing
})
time_taken_deconvol # to check how long it takes to optimise fastest computation
sce_sub <- logNormCounts(sce_sub)
# Model gene variance
dec <- modelGeneVar(sce_sub)
# Select top 2000 HVGs
top_hvgs <- getTopHVGs(dec, n = 2000)
length(top_hvgs)
# Subset logcounts to HVGs for PCA
log_cpm <- logcounts(sce_sub)[top_hvgs, ]
# Transpose & scale
x_scaled <- scale(t(logcounts(sce_sub)[top_hvgs, ]))
## user system elapsed
## 1.982 0.271 2.261
Principal component analysis revealed separation between samples consistent with infection status, indicating structured transcriptional variation.
# Exclude HIV genes
host_genes <- !grepl("^HIV-", rownames(counts(sce_sub)))
sce_host <- sce_sub[host_genes, ]
# Subset metadata & counts to your 8 groups of interest
keep_groups <- c("mdm_mock1","mdm_mock2","mdm_mock3","mdm_mock4",
"mdm_active1","mdm_active2","mdm_active3","mdm_active4")
meta <- as.data.frame(colData(sce_host))
meta_sub <- meta[meta$sample %in% keep_groups, , drop=FALSE]
counts_sub <- counts(sce_host)[, rownames(meta_sub)]
meta_sub$sample <- factor(meta_sub$sample)
# Pseudobulk aggregation (manual, similar to your rowsum)
pb_mat <- t(rowsum(t(counts_sub), group = meta_sub$sample))
# Pseudobulk-aware gene filtering for DESeq2
keep <- rowSums(pb_mat >= 10) >= 2
pb_mat <- pb_mat[keep, ]
dim(pb_mat)
## [1] 15576 8
summary(pb_mat)
## mdm_active1 mdm_active2 mdm_active3 mdm_active4
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 46 1st Qu.: 34 1st Qu.: 18.0 1st Qu.: 21.0
## Median : 223 Median : 173 Median : 95.0 Median : 110.0
## Mean : 1836 Mean : 1327 Mean : 701.4 Mean : 875.3
## 3rd Qu.: 787 3rd Qu.: 603 3rd Qu.: 320.0 3rd Qu.: 388.0
## Max. :4383127 Max. :2637039 Max. :1762057.0 Max. :1675093.0
## mdm_mock1 mdm_mock2 mdm_mock3 mdm_mock4
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 45.0 1st Qu.: 34 1st Qu.: 8.0 1st Qu.: 34.0
## Median : 231.0 Median : 174 Median : 41.0 Median : 179.0
## Mean : 1813.6 Mean : 1242 Mean : 305.6 Mean : 1321.3
## 3rd Qu.: 810.2 3rd Qu.: 601 3rd Qu.: 138.0 3rd Qu.: 610.2
## Max. :2831345.0 Max. :1343859 Max. :506522.0 Max. :1647158.0
# DESeq2 metadata
colData_pb <- data.frame(row.names = colnames(pb_mat),
sample = factor(gsub("[0-9]+$", "", colnames(pb_mat)),
levels = c("mdm_mock", "mdm_active"))) # Set mock as reference
dds <- DESeqDataSetFromMatrix(countData = pb_mat, colData = colData_pb, design = ~ sample)
dds <- DESeq(dds)
res <- results(dds)
res$sc_genes_detected <- rowData(sce_sub)[rownames(res), "sc_genes_detected"]
res$sc_total_counts <- rowData(sce_sub)[rownames(res), "sc_total_counts"]
# Variance stabilising transform for heatmaps
vsd <- vst(dds, blind = FALSE)
# Combine DE results & normalized counts for ranking / heatmaps
dge <- cbind(as.data.frame(res), assay(vsd))
dge <- dge[order(dge$pvalue), ]
dge_dt <- as.data.frame(dge)[, c(
"log2FoldChange",
"lfcSE",
"stat",
"pvalue",
"padj"
)]
dge_dt$sc_genes_detected <- rowData(sce_sub)[rownames(dge_dt), "sc_genes_detected"]
dge_dt <- dge_dt[!is.na(dge_dt$pvalue), ]
# Automatically round all numeric columns to 3 significant figures
dge_dt <- dge_dt %>%
dplyr::mutate(across(where(is.numeric), ~ signif(.x, 3)))
Top Differentially Expressed Genes in MDMs Following Infection
CORIN (log2FC = 4.15, padj = 0.047, detected in 130 cells) and HES1 (log2FC = 3.19, padj = 0.0058, detected in 1629 cells) were among the most upregulated genes. CORIN encodes a serine protease involved in cardiovascular homeostasis, which may reflect activation of stress-response pathways in our infected macrophages. HES1 is a transcriptional repressor downstream of Notch signaling and has been implicated in macrophage differentiation and inflammatory responses, suggesting a shift toward an activated immune phenotype.
MYBL2 (log2FC = -5.23, padj = 0.00019, detected in 384 cells) and RRM2 (log2FC = -4.26, padj = 0.00198, detected in 365 cells) were among the most downregulated genes. MYBL2 is a transcription factor critical for cell cycle progression, and its suppression may reflect a slowdown of macrophage proliferation in response to infection. RRM2 encodes a subunit of ribonucleotide reductase required for DNA synthesis, suggesting a reduction in DNA replication and repair activity.
Differentially expressed genes were visualised using a heatmap of the top ranked genes across experimental groups.
Heatmap visualization of the top 50 differentially expressed genes revealed clear segregation between mock and HIV-infected macrophages. Genes such as CORIN and HES1 were strongly upregulated in infected cells, while MYBL2 and RRM2 were downregulated. Interestingly, most of these top 50 genes appear to be downregulated in the MDM active group when compared to mock, suggesting that the virus may be actively suppressing host gene expression to subvert cellular defenses and favor viral replication. Clustering of both genes and samples highlights coordinated transcriptional responses, suggesting activation of immune and stress-response pathways upon infection. This pattern is consistent across donors, indicating reproducible transcriptional changes associated with infection status.
# Ranked gene list for fgsea
stat <- res$stat
names(stat) <- rownames(res)
stat[is.na(stat)] <- 0
# Download GMT file if missing
GMT <- "ReactomePathways_2025-10-15.gmt"
if (!file.exists(GMT)) {
download.file("https://ziemann-lab.net/public/bioinfo_workshop/ReactomePathways_2025-10-15.gmt", destfile=GMT)
}
gs <- gmtPathways(GMT)
# Run fgsea
fres <- fgsea(pathways=gs, stats=stat, minSize=5, nproc=4)
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |====== | 9% | |======= | 10% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============= | 18% | |============= | 19% | |============== | 20% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 30% | |====================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 41% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
fsig <- subset(fres, padj<0.05)
# Create data frame from fgsea results
fgsea_dt <- as.data.frame(fres)
fgsea_dt$abs_ES <- abs(fgsea_dt$ES)
fgsea_dt <- fgsea_dt[, c("pathway", "NES", "size", "pval", "padj", "abs_ES")]
# Round to 3 significant figures
fgsea_dt <- fgsea_dt %>%
dplyr::mutate(across(where(is.numeric), ~ signif(.x, 3)))
## png
## 2
Pathway Enrichment Analysis
The top pathways with adjusted p-values < 0.05 and the highest absolute enrichment scores revealed biologically meaningful patterns. Among de-enriched pathways, in active MDM compared to mock, “Eicosanoid ligand-binding receptors” which includes genes such as PTGER2 and LTB4R that are involved in inflammatory signaling, and “G2 Phase” and “G2/M DNA replication checkpoint” which contains genes CDK1 and CCNB1, highlighting a reduction in cell cycle progression in the active MDMs. Conversely, enriched pathways included “Acyl chain remodelling of PI,” featuring AGPAT2 and LIPIN1 which are relevant for phospholipid metabolism, and “Anchoring fibril formation,” including COL7A1 and LAMB3, suggesting extracellular matrix remodeling. These findings support our project aims by suggesting that infection triggers a shift in MDMs from proliferative states toward altered lipid metabolism and extracellular matrix interactions, consistent with functional reprogramming during activation.
Overall, the differential expression and pathway enrichment analyses reveal clear transcriptional differences between mock and HIV-active MDMs, with multiple immune-, cell cycle-, and metabolism-related pathways showing significant modulation. These findings directly address the aims of this experiment by identifying key genes and biological pathways associated with macrophage activation and HIV infection, demonstrating that the analytical workflow successfully captures infection-driven transcriptional reprogramming.
The bioinformatics workshop series provided a structured progression from basic data skills to advanced analysis workflows that significantly broadened my understanding of programming and its application to biological data analysis. Early sessions on the Linux command line, R data structures, and R Markdown reports were particularly useful in developing a reproducible workflow for single-cell RNA-seq data. Intermediate R programming skills and hands-on omics analysis directly informed the pipeline presented here. Concepts such as differential expression, PCA, and pathway enrichment, although introduced in the context of bulk RNA-seq, were directly adaptable to my current work. The later sessions on workflow management, HPC usage, machine learning, and proteomics analysis, while not directly applied here, offered useful context and highlighted the versatility of programming tools in addressing diverse biological questions. Overall, the workshops increased my confidence in structuring reproducible, shareable analyses and adapting example code to new biological questions.
The series also strengthened my confidence in independently structuring analyses, documenting code, and producing reproducible results. I now feel equipped to tackle lab-based questions and generate analyses that can be shared or revisited as new data are added.
The workshops covered a wide range of topics, which sometimes made it challenging to consolidate learning before moving to the next focus. Nevertheless, this structure ultimately encouraged adaptability, prompting me to actively integrate and revisit ideas. Exposure to a wide range of tools and approaches has inspired me to explore novel methods and incorporate them into existing pipelines, enhancing both analytical depth and workflow efficiency. If possible for future iterations, having a few consecutive sessions on a similar theme could help participants reinforce key concepts and deepen understanding. Additionally, larger attendance or opportunities for peer interaction, such as in-person buddy systems, could support progress checks and collaborative learning if it is available.
Ziemann M. sc_macrophage: Single-cell macrophage RNA-seq analysis workflow. GitHub repository. https://github.com/markziemann/sc_macrophage.
Ziemann Lab (2024). Bioinformatics Workshop 6: RNA-seq analysis. https://ziemann-lab.net/public/bioinfo_workshop/workshop6.html.
Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
Sergushichev A. fgsea: Fast Gene Set Enrichment Analysis. Bioconductor package. 2016. https://bioconductor.org/packages/fgsea
Jassal B, et al. The Reactome pathway knowledgebase. Nucleic Acids Research. 2020;48(D1):D498–D503.
For reproducibility.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 LAPACK version 3.10.0
##
## locale:
## [1] C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DT_0.34.0 BiocParallel_1.43.3
## [3] patchwork_1.3.2 irlba_2.3.5.1
## [5] fgsea_1.35.2 Matrix_1.7-4
## [7] kableExtra_1.4.0 gplots_3.2.0
## [9] DESeq2_1.49.2 scater_1.37.0
## [11] ggplot2_4.0.1 scran_1.38.0
## [13] scuttle_1.19.0 dplyr_1.1.4
## [15] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
## [17] Biobase_2.69.0 GenomicRanges_1.61.0
## [19] GenomeInfoDb_1.45.4 IRanges_2.43.0
## [21] S4Vectors_0.47.0 BiocGenerics_0.55.0
## [23] generics_0.1.4 MatrixGenerics_1.21.0
## [25] matrixStats_1.5.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 gridExtra_2.3 rlang_1.1.6
## [4] magrittr_2.0.4 compiler_4.5.2 systemfonts_1.3.1
## [7] vctrs_0.6.5 stringr_1.6.0 pkgconfig_2.0.3
## [10] crayon_1.5.3 fastmap_1.2.0 XVector_0.49.0
## [13] labeling_0.4.3 caTools_1.18.3 rmarkdown_2.30
## [16] UCSC.utils_1.5.0 ggbeeswarm_0.7.2 xfun_0.54
## [19] bluster_1.20.0 cachem_1.1.0 beachmat_2.25.1
## [22] jsonlite_2.0.0 DelayedArray_0.35.1 parallel_4.5.2
## [25] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
## [28] stringi_1.8.7 RColorBrewer_1.1-3 limma_3.65.1
## [31] jquerylib_0.1.4 Rcpp_1.1.0 knitr_1.50
## [34] igraph_2.2.1 tidyselect_1.2.1 rstudioapi_0.17.1
## [37] dichromat_2.0-0.1 abind_1.4-8 yaml_2.3.10
## [40] viridis_0.6.5 codetools_0.2-20 lattice_0.22-7
## [43] tibble_3.3.0 withr_3.0.2 S7_0.2.1
## [46] evaluate_1.0.5 xml2_1.5.0 pillar_1.11.1
## [49] KernSmooth_2.23-26 scales_1.4.0 gtools_3.9.5
## [52] glue_1.8.0 metapod_1.18.0 tools_4.5.2
## [55] BiocNeighbors_2.3.1 data.table_1.17.8 ScaledMatrix_1.17.0
## [58] locfit_1.5-9.12 fastmatch_1.1-6 cowplot_1.2.0
## [61] grid_4.5.2 crosstalk_1.2.2 edgeR_4.7.2
## [64] beeswarm_0.4.0 BiocSingular_1.25.0 vipor_0.4.7
## [67] cli_3.6.5 rsvd_1.0.5 textshaping_1.0.4
## [70] S4Arrays_1.9.1 viridisLite_0.4.2 svglite_2.2.2
## [73] gtable_0.3.6 sass_0.4.10 digest_0.6.39
## [76] SparseArray_1.9.0 ggrepel_0.9.6 dqrng_0.4.1
## [79] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1
## [82] lifecycle_1.0.4 httr_1.4.7 statmod_1.5.1