Abstract
Single-cell assay for transposase-accessible chromatin using sequencing
(scATAC-seq) is being increasingly used to study gene regulation.
However, major analytical gaps limit its utility in studying gene
regulatory programs in complex diseases. In response, MOCHA
(Model-based single cell Open CHromatin Analysis) presents major
advances over existing analysis tools, including: 1) improving
identification of sample-specific open chromatin, 2) statistical
modeling of technical drop-out with zero-inflated methods, 3)
mitigation of false positives in single cell analysis, 4)
identification of alternative transcription-starting-site regulation,
and 5) modules for inferring temporal gene regulatory networks from
longitudinal data. These advances, in addition to open chromatin
analyses, provide a robust framework after quality control and cell
labeling to study gene regulatory programs in human disease. We
benchmark MOCHA with four state-of-the-art tools to demonstrate its
advances. We also construct cross-sectional and longitudinal gene
regulatory networks, identifying potential mechanisms of COVID-19
response. MOCHA provides researchers with a robust analytical tool for
functional genomic inference from scATAC-seq data.
Subject terms: Gene regulatory networks, Data processing, Statistical
methods, Software, Biotechnology
__________________________________________________________________
Analytical gaps limit the utility of scATAC-seq for studying gene
regulatory programs in human disease. Here, authors describe MOCHA, a
robust analytical tool with advanced statistical modelling that enables
functional genomic inference in large cross-sectional and longitudinal
human studies.
Introduction
Single-cell assay for transposase-accessible chromatin using sequencing
(scATAC-seq)^[46]1–[47]3 has become increasingly popular in recent
years for studying biological and translational questions around gene
regulation and cell identity and has revealed insights on diverse
topics such as tumor-related T cell exhaustion^[48]4, trained immunity
in monocytes in patients with COVID-19^[49]5, regulators of innate
immunity in COVID-19^[50]6, and potentially causal variants for
Alzheimer’s disease and Parkinson’s disease^[51]7. Many sophisticated
software tools have been developed for analyzing scATAC-seq data^[52]8,
covering functionalities such as dimensionality reduction and
clustering^[53]9–[54]11, semi-automated cell type
annotation^[55]10,[56]11, identification of open chromatin
regions^[57]12–[58]16, characterization of motif usage and
enrichment^[59]17–[60]20, and inference on gene regulatory
networks^[61]11,[62]21,[63]22. Integrating these developments, recent
end-to-end analysis pipelines streamline the analytical process from
quality control and cell type annotation to accessibility and motif
analysis^[64]9–[65]12,[66]23. Together these tools have facilitated the
extraction of biological insights from scATAC-seq data.
Despite these advances, major analytical gaps in scATAC-seq data
analysis limit the construction of robust and reproducible gene
regulatory networks to study human disease. First, human disease
studies require reliable evaluation of sample- and cell-type-specific
open chromatin to capture human genetic heterogeneity and
cell-type-specific regulatory regions. However, visibility into these
forms of heterogeneity is compromised by existing
packages^[67]10–[68]13,[69]15, which usually mix cells across either
samples or cell types to compensate for the low coverage of scATAC-seq.
Second, scATAC-seq data is intrinsically sparse. Only 5–15% of open
chromatin regions are detected in individual cells^[70]9. Both
single-cell and pseudo-bulk ATAC-seq data can contain an excessively
high number of regions without accessibility measurements. While
zero-inflated (ZI) statistical methods are widely used and often
debated in analyzing single-cell ribonucleic acid sequencing
(scRNA-seq) data^[71]24–[72]28, such methods are not implemented in
popular tools for scATAC-seq data analysis, likely leading to many
unreliable results. Third, previous studies have shown that
pseudo-replication bias (cell-interdependence) generates many false
results in scRNA-seq analysis, if left unaddressed^[73]29,[74]30.
Similarly, existing scATAC-seq tools, which use the Wald, Wilcoxon, and
logistic regression tests at the single-cell level^[75]10,[76]31, do
not address this issue and thus may generate many false results as
well. Additionally, others have applied methods designed for bulk
RNA-seq (such as DESeq)^[77]32. The few existing reviews describe the
methods mentioned above, but do not benchmark them
comprehensively^[78]8,[79]31. In longitudinal studies, these challenges
are exacerbated as subjects have multiple interdependent samples. To
postulate robust gene regulatory networks in human disease, the
research community needs a tool to address these challenges.
To this end, we developed a suite of analytical modules for robust
functional genomic inference in heterogeneous human disease cohorts, in
an open R package called MOCHA (Model-based single-cell Open CHromatin
Analysis). First, we developed a method for evaluating sample- and
cell-type-specific chromatin accessibility in low coverage scATAC-seq.
Second, we implemented ZI statistical methods^[80]33–[81]37 for
differential accessibility analysis, co-accessibility analysis, and
mixed effects modeling. Third, we aggregated scATAC-seq data per cell
type into normalized pseudo-bulk tile-sample accessibility matrices
(TSAMs) to capture donor and cell type-centric accessibility. This TSAM
directly addresses the issue of cell interdependence
(pseudo-replication) and enables modeling cross-sectional and
longitudinal gene regulatory landscapes of human disease in large
cohorts. We benchmarked MOCHA against state-of-the-art methods in
identifying regions of open chromatin, differential accessibility, and
co-accessibility. More importantly, we demonstrate MOCHA’s ability to
construct gene regulatory networks from both cross-sectional and
longitudinal analyses of scATAC-seq data on CD16 monocytes from our
COVID-19 cohort^[82]38. We also demonstrate how to integrate MOCHA with
existing tools^[83]11,[84]17,[85]33,[86]39, contrast its functionality
with other existing tools, and adapt it for custom approaches. In all,
we anticipate MOCHA will accelerate the analysis and interpretation of
gene regulatory networks using scATAC-seq data.
Results
MOCHA overview
We developed MOCHA based on our COVID19 dataset (Methods), which was
collected on n = 91 peripheral blood mononuclear cell (PBMC) samples of
either COVID+ participants (n = 18, 10 females and 8 males, 3–5 samples
per participant, a total of 69 samples) or uninfected COVID-
participants (n = 22, 10 females and 12 males, one sample per
participant). We obtained high-quality scATAC-seq data of 1,311,638
cells from the samples. Unless specified, we mainly used a
cross-sectional subset of the COVID19 dataset (denoted as COVID19X,
n = 39) in MOCHA’s development, including data of the COVID- samples
and the first samples of the COVID+ participants during early infection
(<16 days post symptom onset (PSO), n = 17, 9 females and 8 males).
We designed MOCHA to serve as an analytical framework for
sample-centric scATAC-seq data analysis, after quality control
assessments (filtering duplicate fragments and low quality cells), cell
type labeling, and doublet removal. MOCHA identifies sample- and cell
type-specific open chromatin and provides a range of analytical
functions for complex scATAC-seq data analysis (Fig. [87]1 and
Supplementary Fig. [88]1):
Fig. 1. General workflow of MOCHA.
[89]Fig. 1
[90]Open in a new tab
Schematic representation of the core functionalities in MOCHA, starting
from scATAC input data (fragments, black list, cell type labels, and
sample metadata). Using these data, MOCHA generates fragment counts for
every 500 bp tiles (1), normalizes the count data (2), and leverages
single-cell and pseudo-bulk information to identify open tiles in a
cell type- and sample-specific manner (3). It then generates
population-level open chromatin matrices for each cell type (4), which
is the starting point for downstream analytical functions (5). MOCHA
includes improvements to differential accessibility analysis,
co-accessibility analysis, and longitudinal modeling. It also provides
functions for identifying alternatively regulated transcription
starting sites, motif enrichment, and dimensionality reduction. Panels
were generated using Adobe Illustrator. Panel 1 also used BioRender,
released under a Creative Commons Attribution-NonCommercial-NoDerivs
4.0 International license.
(i) Tiling: MOCHA divides the genome into pre-defined 500 base pair
(bp) tiles, which allows head-to-head comparisons on chromatin
accessibility across samples and cell types and avoids complex
peak-merging procedure on non-aligned summits^[91]9,[92]11.
(ii) Normalization: Since sequencing depth may differ across samples in
a large-scale scATAC-seq study (Supplementary Fig. [93]2a), it is
essential to normalize scATAC-seq data prior to meaningful
accessibility analysis. MOCHA counts the number of fragments that
overlap with individual tiles in individual cells, collects the total
and the maximum fragment counts for each tile from all cells of a
targeted cell type per sample, and normalizes the fragment counts by
the total number of fragments of the cell type per sample to reduce the
effects of variations in sequencing depth and cell count (Methods).
Compared with other normalization approaches, this approach resulted in
the lowest coefficient of variation (CV) distribution on 2230 invariant
CCCTC-binding factor (CTCF) sites on the COVID19 dataset (Supplementary
Fig. [94]2b, n = 91). Additionally, this normalization implicitly
addresses batch effects around sequencing depth and cell numbers, the
effects of which may not be evident in the single cell space (Methods,
Supplementary Fig. [95]3). As indicated by the low to moderate CV
values, this approach also makes it possible to compare normalized
accessibility across samples (Supplementary Fig. [96]2b) and cell types
(Supplementary Fig. [97]2c).
(iii) Accessibility evaluation: MOCHA applies logistic regression
models (LRMs) to evaluate whether a tile in a given cell type and
sample is accessible based on three parameters (Methods): the
normalized total fragment count λ^(1), the normalized maximum fragment
count λ^(2), and a study-specific prefactor
[MATH: S :MATH]
to account for differences in data quality between training and user
datasets. The global prefactor, S, is needed for a broad application of
MOCHA as data quality, sequencing depth, and cell count may result in a
large difference in median number of fragments per cell either between
different studies (e.g., 3600 to 9000 fragments per cell, Supplementary
Fig. [98]4a, b) or across cell types in different organs (e.g., 4k −
25k fragments per cell, Supplementary Fig. [99]5). Since
[MATH:
λ(2) :MATH]
can only be evaluated on scATAC-seq data, its usage distinguishes MOCHA
from peak-calling methods based on pseudo-bulk ATAC-seq data only.
Using the full COVID19 dataset (n = 91), we generated pseudo-bulk
ATAC-seq data from scATAC-seq data of all cells of a targeted cell
type, ran MACS2^[100]40 to identify all accessible regions, and used
these labels as the imperfect “ground truth” to train and test the
LRMs. More specifically, we used natural killer (NK) cells
(n = 179,836) for training, which had 750 million total fragments, or
15x the recommended coverage for MACS2^[101]41,[102]42. On this
dataset, MACS2 identified 1.15 million tiles as accessible and the
remaining 4.39 million tiles having fragments were labeled as
‘inaccessible’. Using these tiles as positives and negatives,
respectively, we developed cell count-specific LRMs with varying
coefficient to account for variations in cell count across samples
(Supplementary Fig. [103]6a-b). The Youden Index^[104]43 was used to
balance the tradeoff between sensitivity and specificity (Supplementary
Fig. [105]6b) and generate cell count-dependent thresholds. MOCHA
applies smoothing and interpolation to find the proper coefficients and
thresholds for cell counts not in the training. Once trained, all
coefficients and thresholds for the LRMs are fixed for new datasets and
applied to all cell types, with the exception for S which needs to be
adjusted to account for difference in fragments per cell in new
datasets.
The LRM model was trained exclusively on NK cells in the full COVID19
dataset and then assessed on either other cell types in the dataset
(using the same S) or other human or mouse datasets (using S evaluated
from the corresponding studies). We validated the LRMs (Supplementary
Fig. [106]6c, d) using data of CD14 monocytes (n = 135,949), naive B
cells (n = 60,595), CD16 monocytes (n = 28,525), NK CD56 bright cells
(n = 14,692), and conventional dendritic cells (cDCs, n = 9915). We
used sensitivity, specificity, area under the receiver operating
characteristic (ROC) curve (AUC), and Youden’s J statistic to quantify
the performance (Supplementary Fig. [107]6c, d). Overall MOCHA had a
good performance even at low cell counts. For example, MOCHA achieved
an AUC ranging from 0.693 (CD14 monocytes), 0.703 (CD16 monocytes), to
0.741 (NK CD56 bright cells) with only 50 cells.
(iv) Tile-sample accessibility matrix (TSAM): MOCHA first uses the LRMs
to identify accessible tiles from cells of a targeted cell type in
individual samples and then keeps only tiles that are common to at
least a user-defined fraction threshold of samples. Afterwards, MOCHA
generates a TSAM for the cell type with rows being the accessible
tiles, columns the samples, and elements the corresponding λ^(1)
values. While λ^(2) is informative for identifying sample-specific
accessibility (Supplementary Fig. [108]7), λ^(1) provides a more robust
assessment on chromatin accessibility and was chosen for cross-sample
comparisons and downstream analyses. A total of 215,649 accessible
tiles were identified on CD16 monocytes with a fraction threshold of
20% (Supplementary Fig. [109]2d) across either COVID+ or COVID- samples
in the COVID19X dataset (n = 39). About 25% elements in the obtained
TSAM were zero (Supplementary Fig. [110]2e, f), reflecting the sparsity
of scATAC-seq data even after pseudo-bulking. Statistical testing
against negative binomial (NB) distributions revealed the data was ZI
(Supplementary Fig. [111]18c), indicating the necessity of applying ZI
statistical methods for downstream analysis.
(v) Differential accessibility analysis (DAA): Similar to other
methods, MOCHA first filters out noisy tiles. Tiles are excluded if
either 1) the median log[2](λ[j]^(1) + 1) value (across all samples) is
lower than a user-defined threshold or 2) their difference (between two
sample groups) in percentage of zeros is less than a user-defined
threshold. Unlike other methods, MOCHA includes functions to
heuristically define these thresholds. For the COVID19X dataset
(n = 39), we noticed that the log[2](λ[j]^(1) + 1) values in the TSAM
of CD16 monocytes followed a bi-modal model and thus chose a value of
12 near the higher mode as the median accessibility threshold
(Supplementary Fig. [112]2g). Additionally, we observed a 25%
difference in fragment counts between COVID+ and COVID- samples
(Supplementary Fig. [113]2a), so we set a 50% threshold for the
difference in the percentage of zeros to control for technical
differences. MOCHA then applies a two-part Wilcoxon
test^[114]34,[115]35 to identify differential accessibility tiles
(DATs) within the cell type between the two sample groups (Methods).
DATs have either a large fold change (FC) in accessibility
(Supplementary Fig. [116]2h) and/or large difference in percentage of
zeros (Supplementary Fig. [117]2i). For comparison, ArchR^[118]11 uses
the standard Mann-Whitney-Wilcoxon (MWW) test along with a post-test
log[2](FC) cutoff to identify differential regions on bias-matched cell
populations, while Signac^[119]10 constructs LRMs and prioritizes
differential regions based on FC.
(vi) Co-accessibility analysis (CAA): MOCHA applies the ZI Spearman
correlation^[120]36 to evaluate two types of co-accessibility between
tiles in TSAMs (“Methods” section). The inter-cell type
co-accessibility is evaluated across cell types where data from
different samples are stacked. The inter-sample co-accessibility is
evaluated within a targeted cell type but across different samples.
Both types are important to infer potential gene regulatory networks,
one for understanding differences between cell types and the other for
understanding differences between sample groups.
In addition, MOCHA has functions for dimensionality reduction, motif
enrichment, analysis for alternative transcription starting site (TSS)
regulation, and longitudinal modeling. TSAMs can also be passed to some
existing scATAC-seq tools such as ArchR and chromVAR and other
bioinformatics tools such as Monocle3 for further analysis.
Furthermore, users can easily leverage information from TSAMs to
conduct their own interrogations of scATAC-seq data. MOCHA’s
functionalities are mostly complementary to those of PALMO^[121]44, a
previously published platform by our group for analyzing longitudinal
omics data (Supplementary Table [122]1). MOCHA can run on a standard
desktop or modern laptop computer with at minimum 2GB of RAM, 1 Core
CPU 2.0 GHz/core. For best performance, at least 8GB of RAM and 4 + CPU
cores are recommended to take advantage of MOCHA’s parallelization.
MOCHA reliably detects sample-specific chromatin accessibility
A crucial component of scATAC-seq data analysis is to reliably detect
which chromatin regions are accessible. We benchmarked MOCHA against
the popular tools MACS2^[123]40 and HOMER^[124]45. The former is also
implemented in ArchR^[125]11, Signac^[126]10, and SnapATAC^[127]9. We
compared these tools using three scATAC-seq datasets with different
data quality and sequencing depth (“Methods” section and Supplementary
Fig. [128]6a): (i) COVID19X (n = 39, Fig. [129]2) or COVID19 (n = 91,
Supplementary Fig. [130]6); (ii) HealthyDonor, a dataset of 18 PBMC
samples of n = 4 healthy donors^[131]44; and (iii) Hematopoiesis, an
assembled dataset of hematopoietic cells from 49 samples of diverse
data quality^[132]11, which was treated as a single sample in this
study. To ensure a fair comparison and remove artifacts due to tiling,
we trimmed the broad peaks by MACS2 or HOMER and removed the tail ends
of peaks that extended onto tiles with no signal (“Methods” section).
Three representative cell types with moderate to high cell counts were
selected from each of the three datasets for the comparison, with cell
count per sample ranging from 227 to 743 (COVID19X, median), 163 to 784
(HealthyDonor, median), and 1175 to 27463 (Hematopoiesis), respectively
(Fig. [133]2a and Supplementary Fig. [134]6b).
Fig. 2. Benchmarking MOCHA with MACS2 and HOMER on open chromatin
identification.
[135]Fig. 2
[136]Open in a new tab
a Cell counts per sample in three cell types from each of three
scATAC-seq datasets. The same three cell types in the three
corresponding datasets (Methods) were used, including COVID19X
(n = 39), HealthyDonor (n = 18, middle), and Hematopoiesis (treated as
n = 1, right). b The number of open tiles per sample from MOCHA (light
blue), MACS2 (green), or HOMER (light purple). The same colors are used
in d–h. b, c The two-sided MWW was used to compare results by MOCHA vs
MACS2 (P-value = 4.8e-5, 0.055, 3.6e-11, 0.12, 5.8e-4, 9.9e-4) and
HOMER (P-value = 1.3e-4, 3.0e-2, 1.8e-9, 2.4e-2, 1.32e-4, 3.0e-3),
listed left to right. c UpSet plot showing overlaps between open tiles
from MOCHA, MACS2, or HOMER. For the COVID19X and HealthyDonor
datasets, tiles common to >20% of samples were kept. For the
Hematopoiesis dataset, all tiles were kept. Insert: Violin plot of
signal (i.e., log[2](normalized fragment count+1)) from tiles missed by
MOCHA (i.e., those from MACS2 and/or HOMER but not MOCHA, left l) and
from those unique to MOCHA (i.e., those identified only by MOCHA, right
panel). All P-values = 2.2e-16 (Wilcoxon test). d–f The cumulative
number of detected tiles (d), tiles overlapping with CCCTC-binding
factor (CTCF) sites (e), or tiles overlapping with transcription
starting sites (TSSs, f) as a function of the maximum fraction of
samples without overlapping fragments in the COVID19X (left panel) or
HealthyControl (right panel) datasets. g The number of detected CTCF
sites (top panel) or TSSs (bottom panel) in the Hematopoiesis dataset.
h, The actual (top panel) and the relative (with respect to MOCHA,
bottom panel) runtime to identify open chromatin from single cell data
as a function of the number of samples. The black horizontal line in
the bottom panel marks the MOCHA runtime. CD16 Mono: CD16 monocytes; B
Naive: naive B cells; CD4 CTL: CD4^+ cytotoxic T lymphocytes; CD4 TEM:
CD4^+ effector memory T cells; CD8 TEM: CD8^+ effector memory T cells;
cDC: conventional dendritic cells; CD4 Naive: naive CD4^+ T cells; CD14
Mono: CD14 monocytes. * = 0.01 < P < 0.05, ***P < 0.001. Source data is
provided in “SourceData_Figure2-1.xlsx” and “SourceData_Figure2-2.zip”.
Panels were generated using Adobe Illustrator.
To benchmark performance on sample-specific accessibility, we compared
the number of open regions detected in individual samples
(Fig. [137]2b). On COVID19X, MOCHA detected a median of 129k–195k open
tiles per sample, which was 19–64% higher (significantly with P < 0.05
in 5/6 cases) than the corresponding numbers by MACS2 or HOMER.
Similarly on HealthyDonor, MOCHA detected a median of 117k–216k open
tiles per sample, a 35–59% increase (significantly with P < 0.05 in 5/6
cases) over the corresponding numbers by MACS2 or HOMER. On
Hematopoiesis, MOCHA detected 370k open tiles in cDCs, 665k in naive
CD4^+ T cells, and 1.28 m in CD14 monocytes, which were <7% lower, >43%
higher, and >70% higher, respectively, than the corresponding numbers
by MACS2 or HOMER. Thus MOCHA detected more open chromatin regions than
MACS2 and HOMER in individual samples for almost all cases.
Additionally, we constructed 110 simulated scATAC-seq datasets by
simulating 10 fixed peaksets as the ground truth for open and closed
regions. For each peakset, we sampled 11 cellular abundances (ranging
from 75 to 5000 cells) to mimic cell populations in a typical
scATAC-seq sample. We applied MOCHA, HOMER, and MACS2 algorithms to
identify open regions in the simulated data in the same way as to real
data, without changing any parameters. Across
[MATH: ≈ :MATH]
90% of the simulations, MOCHA had a higher F1 score, and thus more
accurately detected open regions vs noise (Supplementary Fig. [138]8a).
We also conducted simulations by varying the number of fragments per
cell (1k–4.5k) and the total number of open regions (150k–350k,
mimicking three different cell types). The results show that MOCHA
attained the highest F-1 score in 210/240 iterations (87.5%,
Supplementary Fig. [139]8b–d).
To assess the consistency between open tiles detected by the three
tools, we generated TSAMs with a fraction threshold of 20% based on
tiles detected by each tool and compared the corresponding tiles in
TSAMs (Fig. [140]2c). Most tiles were detected by all three tools. As
described above, MOCHA detected more tiles than MACS2 and HOMER in
almost all cases. Among all tiles detected by MACS2 and/or HOMER,
95–97% were also detected by MOCHA in COVID19X, 92–93% in HealthyDonor,
and 84–97% in Hematopoiesis. Tiles detected only by MOCHA contained on
average more fragments than tiles missed by MOCHA (Fig. [141]2c,
inserts; P < 0.001). Furthermore, when we ran linkage disequilibrium
score analysis^[142]46, MOCHA identified 195 enriched annotations,
MACS2 190, HOMER 192, with an overlap of 188 between all three methods
(Supplementary Fig. [143]9). Thus MOCHA not only captured the majority
of tiles detected by MACS2 and HOMER but also added extra tiles of
better signals of potential biological significance than those by MACS2
or HOMER.
To further elucidate differences among tiles detected by the three
tools, we calculated the percentage of zeros among all samples for each
tile in each TSAM and obtained the corresponding cumulant distributions
(Fig. [144]2d). While all three tools agreed on open tiles common to
most samples, MOCHA detected more sample-specific tiles than MACS2 and
HOMER. To test whether these additional tiles potentially contained
biological information, we generated the corresponding cumulative
distributions on tiles mapped to CTCF sites or TSSs and observed
similar patterns (Fig. [145]2e, f). This suggests that the extra tiles
detected by MOCHA may carry important biological information. MOCHA
also detected similar or more open CTCF sites and open TSSs in
Hematopoiesis compared to MACS2 and HOMER (Fig. [146]2g).
Calling peaks on pooled cells of interest is a common practice in
scATAC-seq data analysis^[147]9–[148]11. To compare MOCHA, MACS2 and
HOMER on this approach, we pooled cells of the three cell types in the
three datasets, randomly downsampled the cells to a series of
predetermined cell counts, and applied the three tools to detect
accessible tiles (Supplementary Fig. [149]4c). MOCHA consistently
detected more tiles, more CTCF sites, and more TSSs than MACS2 and
HOMER in almost all cases.
To demonstrate that MOCHA is applicable beyond immune cells and human
data, we benchmarked all three methods on a multi-organ mouse scATAC
atlas^[150]47 (Supplementary Fig. [151]5). We called open tiles on 20
randomly selected and representative organ-cell type populations
ranging from tens to thousands of cells (Supplementary Fig. [152]5a, b)
and compared the number of open tiles, CTCF sites, and TSS detected per
each method (Supplementary Fig. [153]5c–e). MOCHA identified comparable
numbers of open chromatin regions, CTCF sites, and TSSs.
Finally, we benchmarked the runtime for each tool on a cloud computing
environment. MOCHA was consistently faster than HOMER in all tested
cases and MACS2 in all practical cases for sample- and cell-type
specific analyses (Fig. [154]2h). Of note, individual cell populations
in scATAC-seq data range primarily between tens to thousands of cells,
due to technological limitations. Additionally, we estimated run times
by pooling cells across samples (Supplementary Fig [155]4d), and
demonstrated that MOCHA provided fastest runtimes in almost all cases.
MOCHA implements zero-inflated differential accessibility and
co-accessibility analyses
We evaluated MOCHA’s ZI modules against existing state-of-the-art tools
for DAA and CAA. We first benchmarked MOCHA with ArchR and Signac on
DAA. To eliminate differences in open-chromatin/peak-callling by the
three tools and ensure a head-to-head comparison, we restricted the DAA
benchmark to the 215,649 CD16 monocytes tiles identified by MOCHA in
the COVID19X dataset (“Methods” section). We then applied each method
to identify DATs between COVID+ (n = 17) and COVID- (n = 22)
participants using these tiles. MOCHA identified 6211 DATs (false
discovery rate (FDR) < 0.2, Fig. [156]3a, Supplementary Fig. [157]10a).
In comparison, ArchR and Signac detected 6009 and 1266 DATs,
respectively (Fig. [158]3b). While 28% of DATs by MOCHA were in gene
promoter regions, the corresponding percentage was 17% for ArchR and
18% for Signac. As a result, MOCHA, ArchR, and Signac identified 1811,
1006, and 228 genes, respectively, with DATs in their promoter regions.
These genes were enriched (FDR < 0.05) in 27, 1, and 1 Reactome
pathways^[159]48 (Fig. [160]3c), respectively, illustrating a striking
distinction by MOCHA. The same trend was also observed for other
pathway databases (Supplementary Fig. [161]10b). Among the 27 Reactome
pathways revealed by MOCHA (Fig. [162]3d), toll-like receptors (TLRs),
myeloid differentiation primary response 88 (MyD88), interleukins, and
nuclear factor kappa B (NF-κB) pathways all play important roles in
innate immune response to viral infection^[163]49, consistent with the
expected functions of CD16 monocytes. Thus DATs by MOCHA revealed more
biological insights than those by ArchR or Signac.
Fig. 3. Benchmarking MOCHA with ArchR and Signac on differential
accessibility analysis.
[164]Fig. 3
[165]Open in a new tab
a MOCHA’s differential accessible tiles (DATs) in CD16 monocytes
between COVID+ samples during early infection (n = 17) and COVID-
samples (n = 22) in the COVID19X dataset. The volcano plot illustrates
the log[2](FC) on the x-axis against the -log[10](P value) on the
y-axis, where FC represents fold change in accessibility. The
log[2](FC) was estimated using the Hodges-Lehmann estimator^[166]110.
The P value was calculated based on the two-part Wilcoxon test
(two-sided, zero-inflated)^[167]35. DATs with a false discovery rate
(FDR) < 0.2 were considered as significant. b Venn Diagram of MOCHA,
Signac, and ArchR’s DATs. The count and percentage of 1) promoters, 2)
intragenic tiles, and 3) distal tiles are shown for each method and
each Venn diagram subset. c The number of (top) genes with differential
promoters and (bottom) enriched Reactome pathways for each method are
depicted using barplots. d Reactome Pathway enrichment results based
on genes with differential promoter tiles. Pathway categories are
annotated using Reactome’s pathway hierarchy. e Violin plot of Holley’s
|G| from 1000 bootstrapped samples, each containing 50 randomly
selected DATs from each category. Categories with <50 DATs were not
tested. Two-sided Wilcoxon rank-sum test was used. f Leave-one-sample
perturbation analysis to test the robustness of each method in
differential accessibility analysis. New sets of DATs were calculated
iteratively after removing each sample once. The robustness was
assessed by the number of total (solid line) and conserved (dotted
line) DATs detected across perturbations. g Each method’s runtime (in
seconds) as a function of the number of tested tiles. Source data is
provided in “SourceData_Figure3-S10.xlsx”. Panels were generated using
Adobe Illustrator.
To quantify the DAT accuracy for each method, we evaluated its
efficiency in separating COVID19+ and COVID19- samples. We randomly
selected 50 DATs, performed k-means (k = 2) clustering to reflect the
two-group comparison, calculated the absolute value of the G index
(|G|) of agreement^[168]50, and repeated the process 1000 times. DATs
by MOCHA better separate COVID19+ and COVID19- samples than those by
ArchR (P < 0.001) or Signac (P < 0.001, Fig. [169]3e). Similar results
were obtained for N = 25, 50, 75, and 100 randomly selected DATs
(Supplementary Fig [170]12a). It should be noted that this analysis is
aimed for benchmarking, not for routine biological analyses.
Biologically meaningful DATs should be robust against minor changes in
the sample set. Following strategies applied in DESeq2^[171]51, we
assessed the stability of our DAT results across different sample
subsets and benchmarked DAT differences as proxies for evaluating false
positives and false negatives. Starting with DATs from the full sample
set (n = 39), we iteratively removed one sample at a time and
recalculated DATs from the reduced subset of samples (n = 38). We
repeated this process until each sample was removed once. From there,
we collected the set of conserved (e.g., cumulative intersection, proxy
for evaluating false negatives) and inflated (e.g., cumulative
additional, proxy for evaluating false positives) DATs across all
iterations. Ultimately, 3990 (64%) of the 6211 DATs by MOCHA were
conserved (Fig. [172]3f), which was 1.8–3.4 times higher than the
corresponding rate of ArchR (1136/6009, 20%) or Signac (458/1264, 36%).
MOCHA had 5782 (93%) additional DATs, which was 2.7 times lower than
the corresponding rate by ArchR (15310/6009, 255%) and 1.6 times higher
than that by Signac (735/1264, 58%). MOCHA was more robust than ArchR
and had a split performance in comparison with Signac in detecting DATs
regardless of sample set. However, MOCHA had 3990 conserved DATs, 8.7
times higher than those of Signac (458). MOCHA provided a better
balance between sensitivity and robustness in detecting DATs compared
to ArchR and Signac.
When benchmarking runtime, we observed that MOCHA and ArchR took 1.8
and 1.6 min, respectively, to evaluate approximately 200,000 tiles,
while Signac took 18.6 h (Fig. [173]3g). MOCHA was 23–614x faster than
Signac and 0.86–2.8x as fast as ArchR.
In addition, we benchmarked MOCHA against bulk differential methods
from RNA-seq (DESeq2) and ChIP-seq (DiffChIPL). MOCHA performed
significantly better at classification tests (p < 0.001, Supplementary
Figs. [174]11 and [175]12), as well as in identifying enriched
pathways. Additionally, we assessed false positive rates (FPRs) by
shuffling the disease labels (Supplementary Fig [176]12b). MOCHA had a
0% FPR across all 50 random permutations. In comparison, the minimum,
median, and maximum FPR for other methods were: DESeq2 (0%, 0%, 3%),
DiffChipL (0%, .05%, 11.1%), Signac (0%, 0.43%, 13.5%), and ArchR
(0.11%, 1.0%. 41.6%). We also assessed recalls by downsampling the
number of subjects (Supplementary Fig [177]12c). Signac obtained higher
recalls with its very conservative (thus very few) DAT calls, while the
remaining 4 methods obtained similar recalls. We note that the recall
evaluated in this approach was likely a lower bound due to
disease/human heterogeneity and sample size reduction, a loss of > 20%
of power when the sample size was reduced from n = 39 to n = 30 based
on 2-sample t-test (Supplementary Fig [178]12d). Overall, MOCHA
provided comparable performance on recall, and better performance on
FPR than other methods.
Next, we compared the ZI and the standard Spearman correlations across
cell types and samples in the COVID19X dataset based on
log[2](λ^(1) + 1) in TSAMs. Guided by prior studies^[179]52,[180]53, we
leveraged published HiC interaction data to enrich for correlated
regions, and used this to benchmark the two correlation methods using
HiC (Methods). For the inter-cell type co-accessibility, we used known
promoter-enhancer interactions in naive CD4^+ and CD8^+ T cells^[181]54
to define possibly interacting tile pairs (1.21 million) while randomly
selecting 100k tile pairs as a negative background for comparison
(Methods). Both correlation approaches largely generated similar
results (Supplementary Fig. [182]13a, Spearman correlation = 0.687,
P = 2.2 × 10^−16), but disagreements were also observed (Supplementary
Fig. [183]13b). The ZI correlation better distinguished
promoter-enhancer pairs from the random pairs than the standard
correlation (Kolmogorov–Smirnov (KS) test statistic = 0.26 vs. 0.13),
and identified 1000x more promoter-enhancers pairs (15,988 vs. 149,
FDR < 0.1, Supplementary Fig. [184]13c, d). For the inter-sample
co-accessibility, we first collected a subset of tiles in the TSAM of
CD16 monocytes that roughly located in the first million bp of
chromosome 4 (chr4:121500-1130999) and then used both correlation
approaches to calculate the inter-sample correlations between all pairs
(about 34.5k) of these tiles. While both correlation approaches
generally agreed with each other with a rank correlation of 0.69
(P < 0.001, Supplementary Fig. [185]13e), 9550/34,596 (28%) of the
tested correlations switched sign between the two approaches and
2087/34,596 (6%) of them differed in value by >0.25 (Supplementary
Fig. [186]13f). Thus properly accounting for ZI is essential for
reliable CAA in sparse scATAC-seq data.
Networks of alternatively regulated genes in early SARS-CoV-2 infection
To demonstrate how improvements in MOCHA can be leveraged for
constructing gene regulatory networks, we investigated possible
alternative TSS regulation by CD16 monocytes during early SARS-CoV-2
infection (Methods), using the COVID19X dataset. We observed two types
of alternatively regulated genes (ARGs, Fig. [187]4a). Type I (n = 278
genes, Fig. [188]4b) had at least one TSS showing differential
accessibility (FDR < 0.2) between COVID19+ and COVID19- samples, while
other open TSSs had no change. Type II included 5 genes with at least
two differential TSSs in opposing directions (ATP1A1, UBAP2L, YWHAZ,
CAPN1, and ARHGAP9; Fig. [189]4c). Interestingly, all Type II ARGs were
previously associated with COVID-19 and two (ATP1A1 and CAPN1) have
been proposed as therapeutic targets for COVID-19 (Supplementary
Table [190]2). Pathway enrichment analysis^[191]55 on all Type I and
Type II ARGs revealed that they were enriched in pathways of the innate
immune response to infection, including MyD88 and TLR responses
(Fig. [192]4d). The same analysis based on DATs from Signac and ArchR
identified a total of 483 and 45 ARGS (type I and type II),
respectively (Supplementary Fig. [193]14). However, neither of these
gene sets were enriched for any immune pathways, suggesting their
limited utility in such analyses.
Fig. 4. Regulatory network construction on alternative transcription starting
sites in CD16 monocytes during early COVID-19 infection.
[194]Fig. 4
[195]Open in a new tab
a Scatter plot of differential accessibility at potential alternative
transcription starting sites (TSSs). False discovery rate (FDR) and
fold change (FC) were evaluated on chromatin accessibility in CD16
monocytes between COVID+ samples during early infection (n = 17) and
COVID- samples (n = 22) in the COVID19X dataset. All pairwise
combinations of -log[10](FDR)*sign(log[2](FC)) are shown. TSS pairs
were categorized as type I if only one TSS was significantly
differential (FDR < 0.2), or type II if both were significantly
differential but in the opposite directions. Pairs of TSSs that were
significantly differential in the same direction were not considered.
b, c Coverage tracks illustrating type I (b) or type II (c) alternative
TSS regulation around exemplar genes (* denotes a significant
differential accessibility tile (DAT), FDR < 0.2). Precise FDR values
are in the source data file. d Reactome pathway enrichment for genes
with alternatively regulated TSSs (both type I and II). Pathway
annotation was based on Reactome’s hierarchical database. e Motif
enrichment using DATs involved in alternatively regulated TSSs and
their co-accessible tiles (within ±1 M bp, zero-inflated correlation
>0.5). f NicheNet-based ligand-motif set enrichment analysis (LMSEA) on
motifs with FDR < 0.01. g A network centered around CEBPA that was
constructed using significant ligands, motifs, and genes with
alternatively regulated TSS sites. Ligand-motif links represent
NicheNet-based associations. Motif-gene links represent motif presence
in either an alternative TSS tile, or tiles correlated to an
alternative TSS. TF transcription factor; Source data is provided in
“SourceData_Figure4.xlsx”. Panels were generated using Adobe
Illustrator.
To understand the upstream signaling mechanisms regulating ARGs, we
first applied MOCHA to identify potential, distal regulatory tiles that
were co-accessible (inter-sample, within 1 Mbp) with the corresponding
DATs. Motif enrichment analysis on DATs of ARGs and their co-accessible
tiles identified 13 enriched motifs, including activator protein-1
(AP-1) family motifs, PATZ1, and CEBPA (FDR < 0.05, Fig. [196]4e).
Next, we carried out ligand-motif set enrichment analysis (LMSEA,
Methods) based on a priori ligand-motif (transcription factor (TF))
interactions in NicheNet^[197]56. We identified 122 significantly
enriched ligands (FDR < 0.05, Fig. [198]4f), many of which, including
IL17, IL21 and PLAU, have already been implicated in COVID-19 (Source
Data Fig. [199]4g). Finally, we constructed a network that linked
ligands, TFs, and ARGs together (Methods). Notably, the subnetwork of
CEBPA is particularly interesting (Fig. [200]4g): CEBPA was proposed as
a COVID-19 therapeutic target^[201]57 and identified as a key regulator
in CD14 monocytes of hospitalized COVID-19 patients from scATAC-seq
data^[202]6. Furthermore, 29/30 of its upstream ligands were either
therapeutic targets or altered during SARS-CoV-2 infection, and 20/27
of its downstream ARGs were associated to COVID-19 or viral
infection^[203]57–[204]64. Two ARGs, SOCS3/SOCS3-DT and CAPN1, were
potential targets for COVID-19 treatment^[205]65. Using MOCHA’s
differential accessibility and co-accessibility modules, we constructed
a putative upstream regulatory network that could be driving
alternative TSS regulation in CD16 monocytes during early SARS-CoV-2
infection. Given that these results are largely aligned with the
literature, we anticipate that this approach can be used more generally
to identify potentially novel biological mechanisms.
Longitudinal analysis of chromatin accessibility during COVID-19 recovery
To understand chromatin regulation during COVID-19 recovery, we
analyzed scATAC-seq data of CD16 monocytes from our longitudinal
COVID-19 study (Fig. [206]5a). The dataset, denoted as COVID19L, was
collected on 69 longitudinal PBMC samples from 18 COVID+ participants
(10 females and 8 males) over a period of 1–121 days PSO (Methods). We
integrated MOCHA with existing tools and developed customized
approaches to analyze the data at both single-cell and pseudo-bulk
levels.
Fig. 5. Integrative analyses to reveal longitudinal dynamics in CD16
monocytes during COVID-19 recovery.
[207]Fig. 5
[208]Open in a new tab
a Longitudinal COVID19 cohort overview (n = 18). Time points indicated
by black dots illustrate sample availability for each COVID+
participant. b Single-cell Density UMAPs using open regions in CD16
monocytes from MOCHA (full COVID19 dataset, n = 91 samples), showing
cells from COVID+ samples during early infection (1–15 days PSO,
n = 21), late infection (16–30 days PSO, n = 13), and recovery (>30
days PSO, n = 35), and COVID- samples (n = 22). c Volcano plot
illustrating the −log[10](FDR) vs. the slope of motif usage over time.
Motif usage quantified using Z-scores from ChromVAR run on the TSAM.
Insert: The network showing interacting TFs within the AP-1 family
(APID database^[209]118). TFs were color-coded by the sign of their
slope. d Longitudinal motif usage over time for an exemplary set of
TFs. Individual participants and population trends are shown with thin
lines (colored-coded by subject), and thick black lines (population
trend). e Volcano plot illustrating the −log10(FDR) vs. the slope of
promoter accessibility over time based on (ZI-GLMM). A subset of the
top promoters are labeled with their corresponding genes. f Significant
Reactome pathways (FDR < 0.1) enriched for genes with significant
promoter accessibility changes. The pathways were aggregated into
upper-level pathway annotations using Reactome’s database hierarchy.
The barplot shows the number of pathways in each category. The pie
chart breaks down the immune system pathways by Reactome’s next level
categories. g Three scatter plots illustrating examples of associations
between promoter accessibility (y-axis) and JUNB’s ChromVar z-score
(x-axis). The thick black line shows the population trend from ZI-GLMM.
h Bipartite network illustrating associations between the top 5 motifs
(largest positive (+) or negative (−) slope) and 14 significant innate
immune pathways. Motif-promoter associations only shown if Motif is
related to >33% of significant genes in a pathway. c–h Data from CD16
monocytes in the COVID19L dataset (n = 69). PSO post symptom onset,
UMAP Uniform Manifold Approximation and Projection, TSAM tile-sample
accessibility matrix, FDR false discovery rate, TF transcription
factor, AP-1 activator protein-1, ZI-GLMM zero-inflated generalized
linear mixed model. Source data is provided in
“SourceData_Figure5.xlsx”, including precise FDR values. Panels were
generated using Adobe Illustrator.
First, open tiles from the TSAM of CD16 monocytes were imported into
ArchR as a peakset for dimensionality reduction. The resulting Uniform
Manifold Approximation and Projection^[210]66 (UMAP) plot showed a
clear shift in cellular population from initial infection to 30+ days
post infection, at which time the cellular population appeared similar
but not identical to that of the 22 COVID- participants (Fig. [211]5b).
Second, days PSO were binned (Fig. [212]5b) and used to learn a
Monocle3^[213]67 trajectory, which largely followed the cellular
population shift across the UMAP space (Supplementary Fig. [214]15a).
Using ArchR, we further identified genes with GeneScore changes or
tiles from the TSAM with accessibility shifts along this trajectory
(Supplementary Fig. [215]15b). The genes with promoter accessibility
shifts in CD16 monocytes were enriched in 72 immune system pathways,
including 39 innate immune, 15 adaptive immune, and 18 cytokine
signaling pathways (Supplementary Fig. [216]15c, right panel). In
comparison, the corresponding pathway counts from the GeneScore
analysis was only 10, 2, 4, and 4, respectively (Supplementary
Fig. [217]15c, left panel). TSAM-based results were more informative
and aligned better with the expected roles of CD16 monocytes than
GeneScore-based ones.
Third, we converted the TSAM of CD16 monocytes into a chromVAR object
and calculated sample-specific z-scores for TF activity (Methods). This
enabled us to apply generalized linear mixed models (GLMM) to identify
TFs with dynamic activities in CD16 monocytes during COVID-19 recovery,
adjusting for age and sex in the models. Among the 223 TFs whose
activity changed significantly in time (FDR < 0.1, Fig. [218]5c, d),
the AP-1 family (such as ATF1-7, JUN/B/D, MAF/F/G/K, FOS/B, and
BACH1-2; Fig. [219]5c, insert) and the NF-κB family (such as REL/A and
NFKB1-2) mostly had decreased activities, in consistency with their
inflammatory, infection-responsive functions. On the contrary, the
forkhead box (FOX) TF family (such as FOXP1/4, FOXG1, FOXO1, and FOXK2)
had increased activities, which agrees with their known roles in immune
homeostasis^[220]68–[221]70. In comparison, we also identified 86 TFs
with chromVAR z-score changes along the pseudotime trajectory described
above, among which only 31 were unique (Supplementary Fig. [222]16).
Longitudinal analysis based on real time identified more TFs with
dynamic activities during COVID-19 recovery than the trajectory
analysis based on pseudotime.
Fourth, the TSAM of CD16 monocytes was used to examine how gene
promoter accessibility shifted during COVID-19 recovery. The data had
about 20% zeros, which can be associated with either technical dropouts
or biological features. We applied ZI-GLMM to model both types of
zeroes (Methods, Supplementary Fig. [223]18). Given that only 0.193% of
highly accessible tiles (which are defined as having at least 80%
non-zero values within any infection stage and include most promoters)
had variance predominately explained by batch, we did not include batch
as a covariate (Methods, Source Data File: Variance Decomposition). A
total of 2,120 genes demonstrated promoter accessibility shifts over
time (FDR < 0.1, adjusting for age and sex), including genes regulating
immune inflammation such as NFKBIE and DOK3 (Fig. [224]5e and
Supplementary Fig. [225]17a)^[226]71–[227]73. This gene set was
enriched for 71 Reactome pathways (FDR < 0.1; Fig. [228]5f).
Interestingly, among the 23 immune system pathways, five involve
signaling by interleukins and 18 are related to the innate immune
system (such as TLR-, MyD88-, and IRAK1-related pathways), but none are
specific to the adaptive immune system. Again, these results are
consistent with the expected roles of CD16 monocytes during viral
infection.
Finally, to discover possible TF-gene regulations in CD16 monocytes
during COVID-19 recovery, we applied ZI-GLMM to examine whether TFs
with significant activity changes were associated with genes with
significant promoter accessibility changes (Methods). For example, we
found that JUNB chromVAR z-score was significantly associated
(P < 0.05) with the promoter accessibility of 19 genes in the TLR4
cascade pathway (Fig. [229]5g and Supplementary Fig. [230]17b). As an
illustration, we selected the top five TFs having the largest positive
(PLAGL1, ELF5, ETV6, SPIB, and SPIC) or negative (JUN, FOSL1, SMARRC1,
FOSL2 and JUNB) slopes, respectively, against time and examined their
associations with gene promoters within the 18 innate immune pathways
enriched during COVID-19 recovery (Fig. [231]5f). Nine of the 10 TFs
had high to medium expression in CD16 monocytes while the exception
SPIB had a very low expression in CD16 monocytes but high expression in
their progeny, macrophages, based on published
datasets^[232]74–[233]77. The five TFs having negative slopes were
significantly associated (P < 0.05) with 14 pathways (Fig. [234]5h),
consistent with the prominent roles of AP-1 TF family in innate
immunity. Among the five TFs with positive slopes, PLAGL1 and ETV6 were
significantly associated (p < 0.05) with five innate immune pathways,
SPIB and SPIC with 2, and ELF1 with 1, respectively. Four innate immune
pathways did not show significant associations with any of these top 10
TFs. While wet-lab validation is needed, such TF-gene associations
generate interesting biological hypotheses regarding gene expression
regulation in CD16 monocytes during COVID-19 recovery and can be
expanded to include enhancers and additional regulatory elements. To
the best of our knowledge, ZI mixed modeling is currently not possible
in other existing scATAC-seq data analysis packages. Together, these
results demonstrate MOCHA is a valuable tool in studying chromatin
dynamics and gene regulatory networks based on longitudinal scATAC-seq
data.
Discussion
We developed MOCHA to robustly infer active gene regulatory programs in
human disease cohorts based on scATAC-seq. First, we showed that our
open chromatin model significantly agreed and was able to detect more
sample-specific open chromatin regions than MACS2 and HOMER. Second, we
identified differential accessible regions that better distinguish
between COVID+ and COVID- participants than those by ArchR and/or
Signac and uniquely revealed pathways affected by SARS-CoV-2 infection.
Third, we constructed ligand-TF-gene networks on potential alternative
TSS regulations during SARS-CoV-2 infection using only scATAC-seq data.
Fourth, using ZI mixed models, we identified motifs and promoters that
were associated with COVID-19 recovery and constructed a TF-pathway
network to infer which pathways were functionally important during
COVID-19 recovery. MOCHA substantially increased the value of
scATAC-seq in our distinct disease cohorts (including COVID19) by
enabling robust modeling and visibility into the functional
implications of chromatin accessibility. In addition, we illustrated
how MOCHA can be integrated with existing tools such as ArchR,
Monocle3, chromVAR, and NicheNet while enabling customized analysis
using ZI-mixed effects models to gain unique insights from scATAC-seq
data. Furthermore, we demonstrated that MOCHA can also be applied to
data from a range of human and mouse tissues. The LRMs of MOCHA can be
retrained if a different tile size (other than 500 bp) is preferred or
when data is collected on non-diploid cells. Vignettes for analyzing
patterns of dropout across cell types are also provided. Given its
capabilities, we believe MOCHA is a valuable addition for analyzing
scATAC-seq data, especially in biomedical research.
Constructing robust regulatory networks begins with reliable
identification of patient- and cell-type-specific open chromatin. We
used peaks called by MACS2 on pseudo-bulk ATAC-seq data as imperfect
“ground truth” to train and validate MOCHA, and constructed LRMs to
identify sample-specific accessibility using statistically informative
single-cell and pseudobulk features, resulting in λ^(1) and λ^(2).
While other features were tested, they were not statistically
informative. The training data of NK cells (n = 179,836, 750 million
fragments) had enough sequencing depth for reliable MACS2 performance
and thus likely reliable MOCHA training. However, MACS2 might call
every fragment as a peak for less abundant cell types, leading to many
false positives. To mitigate such artifacts, some pipelines
artificially limit the number of peaks called by MACS2^[235]11. To
provide a reasonable comparison, we focused our benchmarking on cell
types with moderate to high cell counts. MOCHA outperformed MACS2 in
calling sample-specific regions despite relying on MACS2 for training.
In theory, MOCHA was not designed to call open tiles on datasets of
mixed cells from multiple studies. For example, we used a global
prefactor
[MATH: S :MATH]
to account for differences in data quality instead of, more properly,
estimating an
[MATH: S :MATH]
for each of the many studies within the Hematopoiesis dataset.
Nevertheless, MOCHA outperformed MACS2 and HOMER on all three datasets
of varying data quality, although only slightly on the Hematopoiesis
dataset. In simulation studies, we benchmarked the three methods across
a broader set of conditions (cell abundance, sequencing depth, and
total open regions). MOCHA outperformed MACS2 and HOMER in about 90% of
all conditions. It is possible that the LRMs in MOCHA may need to be
retrained if the difference in species, sample type, experimental
protocol, sequencing depth, data quality, etc., becomes overwhelmingly
large between our training data and user data. Due to a lack of access
to GPU hardware^[236]78 and integration challenges^[237]13, we
benchmarked MOCHA only with MACS2 and HOMER, which are the most widely
incorporated open source peak callers. Given its strong performance
against these standard algorithms, MOCHA’s robust open chromatin
results provided a solid foundation for downstream analysis. We note
that some biological questions (e.g., copy number variation^[238]79,
causal variants^[239]80, nucleosome positioning^[240]81, and the role
of pioneer factors^[241]82) may not be best answered with tools such as
MOCHA calling broad peaks.
Additionally, gene regulatory networks require clear identification of
accessibility changes. However, the presence of drop-out leads to many
unreliable results. We notice that there has been a debate on whether
scRNA-seq data is ZI^[242]27. That discussion centers around
measurement vs expression models^[243]27. scATAC-seq data is even
sparser than scRNA-seq data with a near binary (0 or 1) measurement of
accessibility^[244]1, and a detection rate of 1– 10% on expected
accessible peaks at the single-cell level (compared to 10–45% of
expressed genes being detected per cell in scRNA-seq)^[245]8. The
incorporation of ZI statistical methods to handle drop out is a major
advantage of MOCHA over most existing tools. ZI methods provide
well-documented improvements over their counterparts on ZI
data^[246]83,[247]84. While ZI methods are applied to scRNA-seq data
exclusively at single-cell level, the sparsity of scATAC-seq data makes
it necessary to apply ZI methods even at pseudo-bulk level in most
analyses. Our analyses showed that excess zeros in pseudobulk
scATAC-seq data were not fully captured by negative binomial
distributions, motivating the use of ZI statistics. We applied the
two-part Wilcoxon model^[248]34 for DAA, ZI correlation^[249]36 for
CAA, and ZI-GLMM^[250]33 for longitudinal modeling in MOCHA and
demonstrated how MOCHA led to more informative results than existing
tools. While ZI-GLMM approaches model biological zeroes and technical
drop-outs separately, more research is needed to fully disentangle
them. No ZI method is needed for open chromatin identification since
only tiles that contain fragments are evaluated and we don’t impute
accessibility for empty tiles. It is noted that, while many bulk
ATAC-seq or bulk RNA-seq approaches (e.g., DiffChIPL, DESeq2, EdgeR,
Limma, and more)^[251]51,[252]85–[253]91 also utilize generalized
linear models; they do not account for ZI. Recent literature has been
moving towards analyzing scATAC-seq data implementing ZI
approaches^[254]92–[255]95.
While single-cell analysis provides granularity into cellular behavior,
human cohort studies are usually interested in identifying
patient-level behavior across cell populations. While current methods
are centered at the single cell level, MOCHA aggregates scATAC-seq data
into TSAMs to facilitate sample-centric analysis. To the best of our
knowledge, while pseudo bulking is fairly common in analyzing scRNA-seq
data and inspired our use of it in MOCHA, this rather simple approach
has not been reported to analyze scATAC-seq data. The approach provides
several important advantages. First, as literature in the scRNA-seq
spaces has shown, the approach specifically addresses
pseudo-replication bias in single-cell data and avoids computationally
expensive single-cell mixed effect models, following recent advice for
analyzing scRNA-seq data^[256]29,[257]30. Second, the sample-centric
approach makes it computationally feasible to analyze large, diverse
human cohorts and explicitly models patient-level heterogeneity. Third,
since the TSAM is constructed from standard Bioconductor data
structures, its flexibility enables a broad range of scientific
enquiries into gene and chromatin regulation and supports seamless
integration with a variety of bioinformatics tools. For example, we
applied ZI-GLMM and chromVAR to study COVID-19 recovery on our
longitudinal scATAC-seq data. We believe TSAMs facilitate the
extraction of genomic insights from large-scale, heterogeneous
scATAC-seq data. Nevertheless, the approach is underpowered for studies
of small sample size and not appropriate for comparing a handful of
samples. We plan to adapt MOCHA for small-scale studies in the future.
We selected CD16 monocytes in our COVID19 dataset to showcase the
utility of MOCHA in biomedical research. Our results reveal from
multiple perspectives that the genomic regions associated with innate
immune pathways (such as TLR, MyD88, and NF-κB) played essential roles
in SARS-CoV-2 infection and patient recovery, aligning with the
expected functions of CD16 monocytes during viral infection^[258]96. To
the best of our knowledge, explicit longitudinal analysis on scATAC-seq
data has not been reported, limiting the value of scATAC-seq in
studying the regulatory landscapes of disease progression and recovery.
Furthermore, despite the large number of publications on COVID-19,
alternative TSS regulation during SARS-CoV-2 infection has not been
reported. We consider MOCHA as a tool to generate interesting
hypotheses from scATAC-seq data, which nevertheless need to be
validated in follow-up studies. An in-depth, comprehensive analysis of
our COVID19 cohort is beyond the scope of current work and will be
presented in a follow-up paper.
In short, we present MOCHA as a tool to better infer gene regulation
from scATAC-seq in biomedical and biological research. MOCHA is freely
available as an R package in [259]CRAN.
Methods
Inclusion and ethical considerations
All study activities were approved by institutional review boards at
the participating institutions where required. Informed consent was
obtained from all participants at the Seattle Vaccine Trials Unit to
participate in the study and to publish their corresponding research
data. Two participants declined to publish their raw sequencing data.
All human data are anonymized, and the human cohort data used in this
manuscript had either prior IRB approval or were publicly available.
Longitudinal COVID-19 cohort
We recruited in the greater Seattle area n = 18 participants (10
females and 8 males, aged 22–79 years) who tested positive (COVID+) for
SARS-CoV-2 virus (Wuhan strain) and n = 23 uninfected (COVID-)
participants (10 females and 13 males, aged 29–77 years) for our
longitudinal COVID-19 study^[260]38, “Seattle COVID-19 Cohort Study to
Evaluate Immune Responses in Persons at Risk and with SARS-CoV-2
Infection”. Due to recruitment during the pandemic, there is a bias
towards first responders. All COVID+ participants had mild to moderate
symptoms. Peripheral blood mononuclear cell (PBMC) and serum samples
were collected from the COVID- participants at a single time point and
from the COVID+ participants at 3–5 time points over a period of 1–121
days post-symptom-onset (PSO, total samples n = 70). Study data were
collected and managed using REDCap electronic data capture tools hosted
at Fred Hutchinson Cancer Research Center (FHCRC). The FHCRC
Institutional Review Board (IRB) approved the studies and procedures.
Informed consent was obtained from all participants at the Seattle
Vaccine Trials Unit to participate in the study and to publish their
corresponding research data. Two participants declined to publish their
raw sequencing data. Sex of participants was determined based on
self-reporting. Participants were not compensated for being in this
study.
COVID19 single-cell ATAC-seq
PBMC isolation
Blood collected in acid citrate dextrose tubes was transferred to
Leucosep tubes (Greiner Bio One). The tube was centrifuged at 800–1000x
g for 15 min and the PBMC layer recovered above the frit. PBMCs were
washed twice with Hanks Balanced Solution without Ca+ or Mg+ (Gibco) at
200–400 × g for 10 min, counted, and aliquoted in heat-inactivated
fetal bovine serum with 10% dimethylsulfoxide (DMSO, Sigma) for
cryopreservation. PBMCs were cryopreserved at −80 °C in Stratacooler
(Nalgene) and transferred to liquid nitrogen for long-term storage.
FACS neutrophil depletion
To remove dead cells, debris, and neutrophils prior to scATAC-seq, PBMC
samples were sorted by fluorescence-activated cell sorting (FACS) prior
to cell permeabilization as described previously^[261]97. Cells were
incubated with Fixable Viability Stain 510 (BD, 564406) for 15 min at
room temperature and washed with AIM V medium (Gibco, 12055091) plus
25 mM HEPES before incubating with TruStain FcX (BioLegend, 422302) for
5 min on ice, followed by staining with mouse anti-human CD45 FITC
(BioLegend, 304038, 1.0 uL/10e6 cells) and mouse anti-human CD15 PE
(BD, 562371, 1.0 uL/10e6 cells) antibodies for 20 min on ice. Cells
were washed with AIM V medium plus 25 mM HEPES and sorted on a BD
FACSAria Fusion. A standard viable CD45+ cell gating scheme was
employed: FSC-A x SSC-A (to exclude sub-cellular debris), two FSC-A
doublet exclusion gates (FSC-W followed by FSC-H), dead cell exclusion
gate (BV510 LIVE/DEAD negative), followed by CD45+ inclusion gate.
Neutrophils (defined as SSC high, CD15+) were then excluded in the
final sort gate. An aliquot of each post-sort population was used to
collect 50,000 events to assess post-sort purity.
Sample preparation
Permeabilized-cell scATAC-seq was performed as described
previously^[262]97. A 5% w/v digitonin stock was prepared by diluting
powdered digitonin (MP Biomedicals, 0215948082) in DMSO (Fisher
Scientific, [263]D12345), which was stored in 20 μL aliquots at −20 °C
until use. To permeabilize, 1 × 10^6 cells were added to a 1.5 mL low
binding tube (Eppendorf, 022431021) and centrifuged (400 × g for 5 min
at 4 °C) using a swinging bucket rotor (Beckman Coulter Avanti J-15RIVD
with JS4.750 swinging bucket, [264]B99516). Cells were resuspended in
100 μL cold isotonic Permeabilization Buffer (20 mM Tris-HCl pH 7.4,
150 mM NaCl, 3 mM MgCl2, 0.01% digitonin) by pipette-mixing 10 times,
then incubated on ice for 5 min, after which they were diluted with
1 mL of isotonic Wash Buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 3 mM
MgCl2) by pipette-mixing five times. Cells were centrifuged (400 × g
for 5 min at 4 °C) using a swinging bucket rotor, and the supernatant
was slowly removed using a vacuum aspirator pipette. Cells were
resuspended in chilled TD1 buffer (Illumina, 15027866) by
pipette-mixing to a target concentration of 2300–10,000 cells per μL.
Cells were filtered through 35 μm Falcon Cell Strainers (Corning,
352235) before counting on a Cellometer Spectrum Cell Counter
(Nexcelom) using ViaStain acridine orange/propidium iodide solution
(Nexcelom, C52-0106-5).
Tagmentation and fragment capture
scATAC-seq libraries were prepared according to the Chromium Single
Cell ATAC v1.1 Reagent Kits User Guide ([265]CG000209 Rev B) with
several modifications. 15,000 cells were loaded into each tagmentation
reaction. Permeabilized cells were brought to a volume of 9 μl in TD1
buffer (Illumina, 15027866) and mixed with 6 μl of Illumina TDE1 Tn5
transposase (Illumina, 15027916). Transposition was performed by
incubating the prepared reactions on a C1000 Touch thermal cycler with
96– Deep Well Reaction Module (Bio-Rad, 1851197) at 37 °C for 60 min,
followed by a brief hold at 4 °C. A Chromium NextGEM Chip H (10x
Genomics, 2000180) was placed in a Chromium Next GEM Secondary Holder
(10x Genomics, 3000332) and 50% Glycerol (Teknova, G1798) was dispensed
into all unused wells. A master mix composed of Barcoding Reagent B
(10x Genomics, 2000194), Reducing Agent B (10x Genomics, 2000087), and
Barcoding Enzyme (10x Genomics, 2000125) was then added to each sample
well, pipette-mixed, and loaded into row 1 of the chip. Chromium Single
Cell ATAC Gel Beads v1.1 (10x Genomics, 2000210) were vortexed for 30 s
and loaded into row 2 of the chip, along with Partitioning Oil (10x
Genomics, 2000190) in row 3. A 10x Gasket (10x Genomics, 370017) was
placed over the chip and attached to the Secondary Holder. The chip was
loaded into a Chromium Single Cell Controller instrument (10x Genomics,
120270) for GEM generation. At the completion of the run, GEMs were
collected and linear amplification was performed on a C1000 Touch
thermal cycler with 96–Deep Well Reaction Module: 72 °C for 5 min,
98 °C for 30 s, 12 cycles of 98 °C for 10 s, 59 °C for 30 s and 72 °C
for 1 min.
Sequencing library preparation
GEMs were separated into a biphasic mixture through addition of
Recovery Agent (10x Genomics, 220016); the aqueous phase was retained
and removed of barcoding reagents using Dynabead MyOne SILANE (10x
Genomics, 2000048) and SPRIselect reagent (Beckman Coulter,
[266]B23318) bead clean-ups. Sequencing libraries were constructed by
amplifying the barcoded ATAC fragments in a sample indexing PCR
consisting of SI-PCR Primer B (10x Genomics, 2000128), Amp Mix (10x
Genomics, 2000047), and Chromium i7 Sample Index Plate N, Set A (10x
Genomics, 3000262) as described in the 10x scATAC User Guide.
Amplification was performed in a C1000 Touch thermal cycler with
96–Deep Well Reaction Module: 98 °C for 45 s, for 9 to 11 cycles of:
98 °C for 20 s, 67 °C for 30 s, 72 °C for 20 s, with a final extension
of 72 °C for 1 min. Final libraries were prepared using a dual-sided
SPRIselect size-selection cleanup. SPRIselect beads were mixed with
completed PCR reactions at a ratio of 0.4x bead:sample and incubated at
room temperature to bind large DNA fragments. Reactions were incubated
on a magnet, and the supernatant was then transferred and mixed with
additional SPRIselect reagent to a final ratio of 1.2x bead:sample
(ratio includes first SPRI addition) and incubated at room temperature
to bind ATAC fragments. Reactions were incubated on a magnet, the
supernatant containing unbound PCR primers and reagents was discarded,
and DNA-bound SPRI beads were washed twice with 80% v/v ethanol. SPRI
beads were resuspended in Buffer EB (Qiagen, 1014609), incubated on a
magnet, and the supernatant was transferred resulting in final,
sequencing-ready libraries.
Quantification and sequencing
Final libraries were quantified using a Quant-iT PicoGreen dsDNA Assay
Kit (Thermo Fisher Scientific, P7589) on a SpectraMax iD3 (Molecular
Devices). Library quality and average fragment size were assessed using
a Bioanalyzer (Agilent, G2939A) High Sensitivity DNA chip (Agilent,
5067-4626). Libraries were sequenced on the Illumina NovaSeq platform
with the following read lengths: 51nt read 1, 8nt i7 index, 16nt i5
index, 51nt read 2.
Data preprocessing
scATAC-seq libraries were processed as described previously^[267]97. In
brief, cellranger-atac mkfastq (10x Genomics v1.1.0) was used to
demultiplex BCL files to FASTQ. FASTQ files were aligned to the human
genome (10x Genomics refdata-cellranger-atac-GRCh38-1.1.0) using
cellranger-atac count (10x Genomics v1.1.0) with default settings.
Fragment positions were used to quantify reads overlapping a reference
peak set (GSE123577_pbmc_peaks.bed.gz from GEO accession
[268]GSE123577^[269]98), which was converted from hg19 to hg38 using
the liftOver package for R^[270]99, ENCODE reference accessible regions
(ENCODE file ID ENCFF503GCK^[271]100), and TSS regions (TSS ± 2 kb from
Ensembl v93^[272]101 for each cell barcode using a bedtools
(v2.29.1^[273]102) analytical pipeline.
Quality control
Custom R scripts were used to remove cells with less than 1,000
uniquely aligned fragments, less than 20% of fragments overlapping
reference peak regions, less than 20% of fragments overlapping ENCODE
TSS regions, and less than 50% of peaks overlapping ENCODE reference
regions. The ArchR package^[274]11 was used to assess doublets in
scATAC data. Doublets were identified using the ScoreDoublets function
using a filter ratio of 8, and cells with a Doublet Enrichment score
exceeding 1.3 as determined by ArchR’s doublet detection
algorithm^[275]11 were not considered for downstream analysis.
Dimensionality reduction and cell type labeling
We used the ArchR package to generate a count matrix for a PBMC
reference peak set^[276]98. Dimensionality reduction was performed
using the ArchR addIterativeLSI function (parameters varFeatures =
10,000, iterations = 2), and the addClusters function was used to
identify clusters in latent semantic indexing (LSI) dimensions using
the Louvain community detection algorithm. For visualization, Uniform
Manifold Approximation and Projection^[277]66 (UMAP) was performed
using ArchR’s addUMAP function at the default settings. The ArchR
addGeneIntegrationMatrix function (parameters transferParams =
list(dims = 1:10, k.weight = 20)) was used to label scATAC cells using
the Seurat level 1 cell types from the Seurat v4.0 PBMC reference
dataset^[278]103. To generate clusters that more closely matched label
transfer results, we performed K-means clustering on the UMAP
coordinates using 3 to 50 cluster centers and identified a set of
clusters that each had > 80% of cells sharing a single cell type
identity. Almost all such clusters contained ≥98% cells from a single
major cell type (T cells, B cells, NK cells, or monocytes/DCs/other),
with the exception of a single cluster with 88% purity. We used
clusters of the same major cell type to subset the data into T cells, B
cells, NK cells, or monocytes/DCs/other for downstream analyses. For
each major cell type, we repeated the same dimensionality reduction
(LSI/UMAP) process on the scATAC-seq data with the same settings. We
then performed a second round of label transfer, using the ArchR
addGeneIntegrationMatrix function (same parameters as described above
for level 1), to reach level 2 and 3 cell labelings of the Seurat PBMC
reference dataset. These labels were consolidated into 25 cell types
for most analysis, except for the co-accessibility analysis where 17
cell types were used to match the published promoter-capture HiC
resource^[279]54. The median cell labeling score across all cells that
passed quality control was 0.74.
Three Human scATAC-seq datasets for MOCHA development and benchmarking
COVID19 dataset
Two samples (1 COVID- sample, male; 1 COVID+ sample, female, collected
on day 12 PSO) from our longitudinal COVID-19 cohort were lost due to
low sample volume. The scATAC-seq data of the remaining samples was
denoted as the COVID19 dataset (n = 91) in this study. After removing
doublets and cells of poor quality, high quality data of 1,311,638
cells were obtained. The data was split into two overlapping subsets
for some analyses: 1) A cross-sectional dataset (denoted as COVID19X,
n = 39) included data of COVID- samples (n = 22, 10 females and 12
males) and the first samples of COVID+ participants (n = 17, 9 females
and 8 males) during early infection (<16 days PSO). 2) A longitudinal
dataset (denoted as COVID19L, n = 69) included all data for the 18
COVID+ participants (10 females and 8 males). The overlap between
COVID19X and COVID19L was 17, which were the first samples of COVID+
participants. The full dataset (COVID19) can be accessed at GEO under
accession number [280]GSE173590.
HealthyDonor dataset
This longitudinal scATAC-seq dataset^[281]44 was collected on 18 PBMC
samples of 4 healthy donors (aged 29–39 years) over 6 weeks (1 female
and 1 male, weeks 2–7; 2 males, weeks 2, 4, and 7). The donors had no
diagnosis of active or chronic disease during the study. The data is
publicly available at GEO under accession number [282]GSE190992. We
used the dataset as is, except we removed cells with doublet enrichment
score exceeding 1.3, based on ArchR’s doublet detection
algorithm^[283]11. High quality data of 145,711 cells were obtained.
From this dataset, we consolidated existing annotations into 25 cell
types with a published median cell labeling score of 0.78.
Hematopoiesis dataset
This dataset was downloaded from
([284]https://www.dropbox.com/s/sijf2votfej629t/Save-Large-Heme-ArchRPr
oject.tar.gz). It consists of ~220,000 hematopoietic cells from the
hematopoiesis dataset in ArchR^[285]11. As described in their
Supplementary Table [286]1, the dataset was assembled from 49 samples
in four data sources, of different sample types (mixed, sorted, and
unsorted cells; PBMCs; and bone marrow mononuclear cells), and
generated using different sample processing protocols and on different
technical platforms. We used ArchR to generate doublet scores and
removed clusters with both high doublet scores and a mixture of
disparate cell types. This doublet removal only applied to sequencing
wells that were not sorted, purified cells. In the end, data of 95,599
cells were obtained. Because many of the cell types were sorted
populations run on an individual well, many cell types were not
available across samples. As a result, we treated all cell types as
coming from a single sample for benchmarking purposes. We used
previously published cell annotations, with a median labeling score of
0.70.
Mouse Pan-Organ scATAC Atlas
This dataset (GEO: [287]GSE111586) was downloaded from
[288]https://atlas.gs.washington.edu/mouse-atac/. We performed doublet
removal and then used the data for benchmarking purposes on non-human
scATAC-seq datasets. The original cell type labels were used in the
analysis.
Assessing dataset noise using the Altius Peakset
To assess ‘noise’ within a dataset (i.e., fragments from closed regions
known as heterochromatin), we used the Altius consensus
peakset^[289]100 of over 3.6 million DNase I hypersensitive sites
within the human genome as an approximation of all potentially
accessible sites. We calculated the overlap rate between fragments and
the Altius consensus peakset for each cell type and dataset, in order
to assess the quality of the data. The COVID19 dataset had an median
Altius peakset overlap rate of 88.9%, while the corresponding rates for
the HealthyDonor dataset and the Hematopoiesis dataset were 75.5% and
82%, respectively.
MOCHA overview
MOCHA is implemented as an open-source R package under the GPLv3
license in [290]CRAN. All code and development versions of MOCHA are
available in [291]MOCHA’s github repository.
MOCHA is designed to run in-memory and interoperate with common
Bioconductor methods and classes (e.g., RaggedExperiment,
MultiAssayExperiment, and Summarized Experiment). It takes as input
four objects that are commonly generated from scATAC-seq after cell
labeling and the removal of doublets and cells of low quality data.
These four objects are: 1) a list of GRanges or GRangesList containing
per-sample ATAC fragments, 2) cell metadata with cell labels, 3) a
BSGenome annotation object for the organism, and 4) a GRanges
containing blacklisted regions. These inputs can be passed to MOCHA
directly from an ArchR object. Alternatively, results can be extracted
from Signac, SnapATAC, or ArchR, and converted to common Bioconductor
data objects, which can then be imported into MOCHA. By operating on
well-supported Bioconductor objects, MOCHA’s inputs and outputs are
compatible with the broader R ecosystem for sequencing analyses, and
are easily exportable to genomic file formats such as BED and BAM.
MOCHA’s core functionality runs as a pipeline from these inputs to
perform sample-specific open tile prediction and consensus analysis,
resulting in a TSAM represented as a Bioconductor
RangedSummarizedExperiment. On systems with sufficient memory, MOCHA’s
functions can be parallelized over samples with the ‘numCores’
parameter to decrease runtime. From the TSAM, MOCHA provides functions
for ZI co-accessibility and ZI differential accessibility analysis. The
format of the TSAM output enables additional downstream analyses with
other R packages. For example, the TSAM RangedSummarizedExperiment can
be used directly as the counts matrix input for motif deviations
analysis with the chromVAR R package. Additional details on the
workflow and functions on the MOCHA package are provided in
Supplementary Fig. [292]1.
MOCHA objects
MOCHA’s workflow generates two major objects: an OpenTiles Object, and
a Tile-Sample Accessibility Matrix (TSAM) object. An OpenTiles Object
is created by callOpenTiles and stores tile accessibility information
at the sample and cell type level, along with sample-specific metadata
in a MultiAssay Experiment, which contains:
* RaggedExperiment (Bioconductor) for each cell type
+ ◦
Accessibility calls and normalized intensity across samples
* Sample-level metadata in the colData slot.
* Additional metadata as a list in the metaData slot.
+ ◦
Cell type-level metadata in a SummarizedExperiment
(Bioconductor)
o ▪
Cell counts, fragment number, and other relevant metadata
+ ◦
Transcript and Genome database used for calling
+ ◦
History of the object
From there, user-defined cutoffs are used to identify study-level open
chromatin regions. These regions are used to generate tile-by-sample
accessibility matrices (i.e., TSAMs) for downstream analysis. These
TSAMs are saved in a SummarizedExperiment format, containing:
* Tile-Sample Accessibility Matrix for each cell type
+ ◦
Counts at the complete set of open regions (across all cell
types)
* Tile metadata saved in the rowRanges slot
+ ◦
Genomic positions for each tile
+ ◦
Boolean/indicator columns labeling population-level open tiles
for each cell type
+ ◦
Columns labeling the type of region (e.g., promoter,
intergenic, distal)
+ ◦
Columns labeling the associated genes for promoter and
intergenic tiles
* Sample-level metadata in the colData slot.
* Additional metadata as a list in the metaData slot.
+ ◦
Cell type-level metadata in a SummarizedExperiment
(Bioconductor)
o ▪
Cell counts, fragment number, and other relevant metadata
+ ◦
Transcript and Genome database used for calling
+ ◦
History of the object
These TSAM can then be integrated with ArchR, ChromVar, and other
Bioconductor-compatible R packages for downstream analyses.
Tiling the genome
MOCHA splits the genome into pre-defined, non-overlapping 500 base-pair
tiles that remain invariant across samples and cell types. MOCHA
annotates each tile using a user-provided transcript database (e.g.,
HG38 Transcript Database) as follows: Promoter regions are 2000 bp
upstream and 200 bp downstream from transcriptional start
sites^[293]104. Intragenic regions are tiles that fall within a gene
body, but not within the promoter regions. All other regions are
classified as distal.
From there, we only consider tiles that overlap with ATAC fragments.
MOCHA counts the number of fragments in a tile as follows
[MATH:
fi,j,<
/mo>k,t=thenumberoffragmentsonsamplei,ofcelltypej,incellk,overlappingwithtilet :MATH]
1
[MATH: Fi,
j=thetotalnumberoffragmentsonsamplei,ofcelltypej :MATH]
2
If a fragment falls between two tiles, it is counted on both tiles.
Normalization
Normalization techniques using invariant CTCF sites
We examined three normalizing approaches: dividing the number of
fragments by 1) the total number of fragments for sample i, cell type j
(i.e.,
[MATH:
Fi,j :MATH]
); 2) the total number of fragments for sample i (i.e.,
[MATH:
Fi=ΣjFi,j
:MATH]
), and 3) the total number of cells in sample i, cell type j. We
evaluated the above normalization methods along with the raw data based
on a list of 2230 cell-type invariant CCCTC-binding factor (CTCF) sites
from the ChIP Atlas database^[294]105. These loci were identified in at
least 201/204 (99%) of blood cell types present in the ChIP-seq Atlas
database^[295]106. Using these CTCF sites, each approach was assessed
based on the corresponding distribution of coefficient of variation
(CV) in peak accessibility. MOCHA normalizes data using
[MATH:
Fi,j :MATH]
.
Sample- and cell type-specific normalization
For each sample i, cell type j, and tile t, MOCHA calculates the
following normalized features:
[MATH: λ(1)i,j,t=Σkfi,j,k,t/Fi,j
mrow>×10<
mn>9=thetotalnormalizedfragmentsforsamplei,celltypej,attilet :MATH]
3
[MATH: λ(2)i,j,t=(max{f<
/mrow>i,j,k,t}k/Fi<
mo>,j)×
109=<
/mo>themaximumnumberofnormalizedfragmentsacrosssinglecells,forsamplei,celltypej,tilet :MATH]
4
Since the NK population used for model training contained 750 million
fragments, a scaling factor of
[MATH: 109
:MATH]
is applied to make the raw and normalized counts on the same scale
across cellular abundances, and keep normalized values greater than 1
to minimize convergence errors in downstream model training.
Biologically, λ^(1)[i,j,t] is designed to capture the total number of
fragments across all cells (e.g., pseudo-bulk), normalized by the
sequencing depth for that cell type and sample. Given the sparsity of
scATAC-seq data and the assumption of limited number of genomic copies
(2x-4x) in a typical cell, λ^(2)[i,j,t] is designed to capture the
presence of multiple fragments in a tile from any cell, which can only
be evaluated on single-cell data. This approach combines single cell
and pseudo-bulk information for downstream prediction. Normalizing by
[MATH:
Fi,j :MATH]
is used to normalize both sequencing depth and cell population
variability. This approach provides both a sample- and
cell-type-specific normalization scheme.
Evaluation of open chromatin accessibility
Training of logistic regression models for predicting tile accessibility
MOCHA assumes a typical ploidy per cell (two to four copies of the
genome). Its pseudocode and further details are provided in
Supplementary Fig. [296]2 in order to allow for modifications when the
above assumption does not hold.
We used scATAC-seq data of 179k NK cells in the COVID19 (n = 91)
dataset as the training dataset. First, we normalized the scATAC-seq
data and collapsed it into pseudobulk data. Second, we applied
MACS2^[297]40 (‘-g hs -f BED --nolambda --shift -75 --extsize 150
--broad’, ‘--model -n’, using the parameters set in ArchR) to identify
accessible peaks in the pseudobulk data. We chose these settings to
match the behavior of MACS2 in a commonly-used scATAC data analysis
software, modified to call broad rather than narrow peaks. The
resulting peaks were then overlaid onto our pre-defined 500 bp tiles.
We trim the broad peaks by 75 base pairs at each end and remove the
tail ends of peaks that extend onto tiles with no signal. MACS2
identified 1.15 million tiles as ‘accessible regions’ and the remaining
4.39 million tiles with fragments were labeled as ‘inaccessible’. We
labeled all other fragment-containing regions as inaccessible, and used
these ‘accessible’ and ‘inaccessible’ regions for training. Third, we
randomly selected NK cells at cell counts ranging from 170k to 5 at
discrete intervals, generating 10 replicates for subsets <50k cells,
and 5 replicates for larger subsets. In each of the subsets, we
calculated λ^(1)[i,j,t] and λ^(2)[i,j,t] at individual tiles. Fourth,
we trained a LRM for each selected subset of NK cells based on the tile
labeling just described. The LRM was fitted using a logit ‘link’
function and applying the default loss function from the GLM logistic
regression function in R. For each sample of
[MATH: na
:MATH]
cells, the LRM calculates a probability score to assess the likelihood
of a tile being accessible, using the formula
[MATH:
Pr(<
mi>na)<
/mrow>(λ
mrow>(1)i,j,t,λ(2)i,j,t
)=1
1+exp(−<
msubsup>b0(
na<
/mrow>)−b<
/mi>1(na)*λ
(1)
msup>i,j,t−b2
mrow>(na)*λ(<
mn>2)i<
mo>,j,t)<
/mrow> :MATH]
5
Here
[MATH:
b0
(na
) :MATH]
is the intercept,
[MATH:
b1
(na
) :MATH]
and
[MATH:
b2
(na
) :MATH]
are coefficients for λ^(1)[i,j,t] and λ^(2)[i,j,t], respectively. A
tile is predicted as accessible if
[MATH:
Pr(<
mi>na)<
/mrow>≥θ(
mo>na) :MATH]
or inaccessible if
[MATH:
Pr(<
mi>na)<
/mrow><θ(<
/mo>na
mrow>) :MATH]
where
[MATH:
θ(na)
mrow> :MATH]
is the threshold value separating accessible and inaccessible tiles.
Since it is not possible to balance open vs closed tiles in real-data,
this leads to prevalence imbalances that need to be addressed. To
account for this, we used the Youden index^[298]107 to calculate
optimal
[MATH:
θ(na)
mrow> :MATH]
in this study, using the cutpointR R package.
Fifth, we collected
[MATH:
{b0<
/mrow>(na)<
mo>} :MATH]
,
[MATH:
{b1<
/mrow>(na)<
mo>} :MATH]
,
[MATH:
{b2<
/mrow>(na)<
mo>},{θ<
/mrow>(na)
} :MATH]
from the 10 or 5 replicated runs on n cells and then took the
corresponding median coefficients, i.e.,
[MATH:
b0
(n) :MATH]
,
[MATH:
b1
(n) :MATH]
,
[MATH:
b2
(n) :MATH]
, and
[MATH:
θ(n) :MATH]
, to construct the predictive model for a sample of n cells. Finally,
we used the learned coefficients and the learned thresholds to smoothen
the model to interpolate the model across cellular abundances that the
model was not trained on. The final model is composed of a set of
smoothened coefficients
[MATH:
{b0<
/mrow>(n)} :MATH]
,
[MATH:
{b1<
/mrow>(n)} :MATH]
,
[MATH:
{b2<
/mrow>(n)} :MATH]
, and smoothened thresholds
[MATH:
{θ(n)} :MATH]
from all examined
[MATH: n :MATH]
.
Model evaluation metrics
To quantify a model’s performance, we counted their true positives
(TPs), false positives (FPs), true negatives (TNs), and false negatives
(FNs), and used these to calculate false positive rates, recall, and
F1-score.
[MATH: FalsePositiveRate=1−Precision=1−TP/(TP+FP),Recall=Sensitivity=1−FalseNegativeRate=TP/(TP+FN),F1score=2*TP/(2<
mo>*TP+FP+FN).Specificity=1−FPRYouden’sJ=Sensitivity+Specificity−1 :MATH]
Prediction of tile accessibility on new data
To predict accessibility in a new dataset, MOCHA first accounts for
differences in sequencing depth and cell count across datasets and
calculates the ratio, S, of the median (across samples) number of total
fragments in the training data, and the corresponding median in the new
dataset. MOCHA then scales both λ^(1)[i,j,t] and λ^(2)[i,j,t] in the
new dataset by S and calculates the likelihood of a tile being
accessible as
[MATH:
Pr(n)(λ(1)i,j<
/mi>,t,λi,j,t(2)<
/msubsup>,S)=<
mrow>11+exp[
−b0
mrow>(n)−b1
(n)
mrow>*(Sλi,j,t
mi>(1))−
b2(n
mi>)*(<
mrow>Sλ(2)i,j,t)],
:MATH]
6
where
[MATH: n :MATH]
is the number of cells of the targeted cell type in the targeted
sample. The normalization and scaling factors enable the same
smoothened coefficients to be used on new data, regardless of cell type
or tile position. As before, a tile is predicted as accessible if
[MATH:
Pr(n)≥θ(n) :MATH]
or inaccessible if
[MATH:
Pr(n)<θ(<
/mo>n) :MATH]
where
[MATH:
θ(n) :MATH]
is the threshold value separating accessible and inaccessible tiles.
Benchmarking open regions
To benchmark MACS2, HOMER, and MOCHA, we ran each tool per sample and
cell type to generate comparable accessibility measurements across
three cell types in three different datasets. For MACS2, we used the
following parameters to call broad peaks (‘-g hs -f BED --nolambda
--shift -75 --extsize 150 --broad --nomodel -n’), in accordance with
previous published scATAC-seq settings^[299]11. For HOMER, we used the
findPeaks function with default parameters, and added (‘-style
histone’) to call broad peaks. While HOMER and MACS2 are primarily
designed around the properties of ChIP-seq and DNase-seq, they are also
recommended for use with bulk ATAC-seq^[300]41.
To ensure head-to-head comparisons, we overlaid HOMER and MACS2’s peaks
into MOCHA’s predefined 500 base-pair tiles to translate peak calls
into open tile calls. Similar to training, we trimmed 75 bp off each
end, as MACS2’s shift/extsize parameters extend fragments to improve
peak calling under the -nomodel flag. By trimming, we avoid counting
the tails of peaks that extend into tiles with no actual fragments.
This trimming approach allows for a direct head-to-head comparison of
open regions detected across methods. Additionally, all three methods
are provided the same normalized pseudo-bulk intensity information to
ensure comparable peak calling and prevent confounding peak calling and
normalization. After translating MACS2 and HOMER peaks into open tiles,
we then compared the number of open tiles per sample across all
methods, cell types, and datasets.
Next, we generated a TSAM for each cell type across all three methods.
The TSAM is a matrix with an array-type structure, where each cell
contains the normalized λ^(1)[i,j] intensities for a given sample i, at
tile j. We kept open tiles that were called in at least 20% of samples
(or all tiles in Hematopoiesis). By generating a TSAM for each method,
we compared reproducible, population-level open tiles across all three
methods. The 20% threshold was applied to filter out noisy data.
CTCF and TSS Sites for benchmarking
CTCF sites were drawn from the ChipSet Atlas^[301]105. In brief, we
download a bed file containing CTCF peaks for all blood cell types, and
then used Plyranges’s reduce_ranges^[302]108 function to collapse
duplicate peak calls into one non-redundant and smaller file for
detecting overlaps. This process was done for both Hg19 (n = 197,882)
and Hg38 (n = 184,588). TSS sites were taken from Bioconductor database
TxDb.Hsapiens.UCSC.hg19.knownGene for the Hematopoiesis dataset (which
was aligned to Hg19), and TxDb.Hsapiens.UCSC.hg38.refGene for the other
datasets by first extracting the transcripts for all genes. The TSS
were then extracted from the transcripts using the promoters() command
(Hg19, n = 62,265, Hg38, n = 88,819). We then calculated the number of
tiles that overlapped with a CTCF and TSS site using the
subsetByOverlaps function from the GenomicRanges^[303]104 R package.
Runtime comparison on open chromatin analysis
Using the CD14 monocyte population in our COVID19 dataset (n = 91), we
produced 13 subsamples ranging from 100,000 to 10 cells and measured 10
replicates of the time it takes to conduct open chromatin analysis. Our
runtime comparisons were conducted on N2 machines on the Google Cloud
Platform, with 64 vCPUs and 512GB RAM. MOCHA version 0.2.0 was used.
The R package tictoc was used to record elapsed time.
Downsampling comparison on open chromatin
Since pooling cells across samples before calling open tiles is a
common approach, we benchmarked all three methods on the same randomly
selected cell subsets ranging from 5 cells to the full set in the
COVID19 dataset (n = 91). For this comparison, we utilized the same
three cell types across the same three datasets. For each cell type and
dataset, we used the following procedures. For MOCHA:
1. Generate a coverage object using predefined 500 base-pair tiles on
all pooled cells (e.g., CD16 monocytes).
2. Predict open tiles on the pooled cells.
3. Count the total number of open tiles, the number of open tiles
overlapping with CTCF sites, and the number of open tiles
overlapping with TSSs.
4. Repeat (1–3) for all pre-specified downsampled cell counts.
For MACS2 and HOMER:
1. Generate a coverage file on all pooled cells (e.g., CD16
monocytes),
2. Call peaks on the pooled cells.
3. Convert the peak regions onto the pre-defined MOCHA tiles.
4. Remove tiles that extend onto empty regions.
5. Count the total number of open tiles, the number of open tiles
overlapping with CTCF sites, and the number of open tiles overlapping
with TSSs.
6. Repeated (1–5) for the pre-specified downsampled cell counts.
Simulating fragments and open chromatin from scATAC-seq
To benchmark HOMER, MACS2, and MOCHA, we designed a simulation study to
determine the sensitivity, specificity, and accuracy of each method.
The simulated scATAC-seq data is constructed in 6 steps as follows:
Step 1: Defining open and closed tiles as the ground truth. We first
splitted each chromosome into 500 bp tiles. We then assigned roughly
200,000 open tiles across the genome, with the number of open tiles per
chromosome being proportional to the chromosome’s length. Afterwards,
we distributed the open tiles along each chromosome with a sinusoidal
probability (negative values set to 0) with a periodicity of 10 tiles,
to mimic periodic appearances of open regions followed by large closed
regions. All other tiles were defined to be closed. We treated these
open and closed tiles as the ground truth in this simulation analysis.
In this step, each sampling draws a different peakset and thus mimics a
cell type with its own peakset.
Step 2: Simulating the number of fragments in individual cells. Each
simulated cell was randomly allocated a mean of 4k fragments, following
a Poisson distribution (
[MATH: λ :MATH]
= 4000). To simulate our quality control (QC) thresholds for
sequencing depth, we excluded cells with fewer than 1000 fragments. The
process was repeated until a target number of simulated cells was
reached.
Step 3: Assigning fragments to open and closed tiles within each cell.
Simulating a FRIP (fraction of reads in peaks) score of 95%, a random
number of fragments were assigned to the open tiles using a Poisson
distribution (
[MATH: λ :MATH]
= 0.95*fragment number) while the rest fragments were randomly
assigned to the closed tiles with equal weight. For the open tiles, we
first simulated their relative weight using a Beta distribution (1,3)
and then allocated the corresponding fragments to individual open tiles
according to their relative weights. These weights were set to simulate
real data of many low accessible tiles and few highly accessible ones.
Step 4: Positioning fragments in individual tiles. To determine the
precise location of fragments assigned to open tiles, we first assigned
the midpoint of each fragment to the center of the corresponding tile
and then used a Poisson distribution (
[MATH: λ :MATH]
=5) to add a random offset from the center. The direction of the offset
(left or right) was randomly generated with a Bernoulli distribution
(p = 0.5). For fragments assigned to closed tiles, their position was
randomly distributed across ‘closed’ tiles.
Step 5: Determining the length of individual fragments. We simulated
the length of individual fragments following a Poisson mixture
distribution so that approximately 90% of fragments have a mean length
of 75 bp, and 10% with a mean length of 200 bp.
Step 6: Assembling simulated scATAC-seq data. All simulated fragments
from all simulated cells were treated as a scATAC-seq sample, saved in
a single fragment file, and converted into a normalized bedgraph file
for peak calling.
Using the above process, we ran two sets of simulations: 1) varying the
number of cells to model cell types of different abundances, and 2)
varying the total number of open regions to mimic cell types of
different open tiles. For the first set, we simulated samples with a
range of cell counts: 75, 100, 150, 200, 250, 500, 1000, 1500, 2000,
3000, and 5000 cells. Tiles having no fragments were removed from
individual samples, resulting in different ground truth tiles per
sample. This process was repeated 10 times for each cell count. The
pairwise overlap rate between the 10 peaksets sampled in Step 1 ranged
between 8.4% to 8.6%. 110 simulated samples were generated.
For the second set of simulations, we fixed the number of cells (250)
but varied the number of fragments per cell (1k, 1.5k, 2k, 2.5k, 3k,
3.5k, 4k, 4.5k) and the location and number of open regions (150k,
250k, 350k). The three peaksets sampled in Step 1 overlapped at rates
between 6.3% to 14.8%. For each peakset, 10 samples were simulated at
each value for fragments per cell, generating a total of 240 simulated
samples. This enables the simulation of different cell types (total
open regions and/or fragment variations due to ploidy variation) and
technical influences from sequencing depth (fragments per cells). Based
on existing vignettes and documentation, we note different tools apply
different QC cutoffs to filter out low quality cells: ArchR=1k,
Signac=1k, SnapATAC=3k. To cover a wide range of conditions, we
simulated fragments per cell from 4.5k (our internal datasets) down to
1k, where ArchR and Signac would retain approximately 50% of cells.
We then ran all three methods (MOCHA, MACS2, & HOMER) to identify open
tiles, in the following manner. For MACS2 and HOMER, we called peaks
and converted the open peakset into 500 bp tiles for direct comparison
across methods, and removed tiles that minimally overlap the peakset
(<75 bp overlap) to avoid tiling artifacts. Finally, we compared the
open tiles from each method with the ground truth using previously
described model evaluation metrics.
Assessment on zero-inflation in pseudobulk scATAC-seq Data
To assess ZI in pseudobulk scATAC-seq data, we fitted a negative
binomial (NB) distribution on pseudobulked accessibility λ[j]^(1) at
each open tile. We then tested whether λ[j]^(1) was ZI using the DHARMa
R package, which utilizes the model fits from the glmmTMB R package.
More specifically, the test compared the observed number of zeros with
that expected from a NB distribution: An estimate of >1 means that
there are more zeros than expected by a NB model and a p < 0.05 means
that the observed and the expected zeros in the data is significantly
different. To be comprehensive, we ran the test using two standard
parameterizations of the NB family, as implemented in the glmmTMB
package: The variance grows either linearly (NB1) or quadratically
(NB2) with the mean. The λ[j]^(1) was ZI if p < 0.05 and statistic > 1,
or underinflated if p < 0.05 and statistic <1.
Differential accessibility analysis
MOCHA’s zero-inflated method for DAA
MOCHA identifies differential accessibility tiles (DATs) in a targeted
cell type between sample groups A and B in three steps:
First, similar to others^[304]10,[305]11, MOCHA prioritizes tiles for
testing using heuristic functions to calculate two data-driven
thresholds. MOCHA transforms the total fragment count {λ^(1)} in the
corresponding TSAM to log[2](λ^(1) + 1) and fits a mixture model of two
normal distributions on all log[2](λ^(1) + 1) values in the TSAM
(Supplementary Fig. [306]2g). This bimodal model provides a heuristic
threshold to prioritize high-signal tiles. From there, we used the TSAM
metadata to identify any differences in sequencing depth by comparing
the median number of fragments per sample between groups. This analysis
informs the ZI threshold. Given our initial observations of a 25%
difference in fragment counts, we set a 50% threshold (2X the observed
sequencing depth difference) to control for technical artifacts. Tiles
that do not pass either threshold are assigned a DAT P value of NA, and
those passing thresholds are then tested for differential
accessibility.
Second, MOCHA tests for differential accessibility as follows. Denote
the percentages of zeroes among samples of the two groups as ρ[A] and
ρ[B] and the corresponding medians of non-zero log[2](λ[j]^(1) + 1)
values as μ[A] and μ[B]. MOCHA then tests whether a tile is a DAT based
on the following hypothesis testing:
Null hypothesis (
[MATH: H0
:MATH]
): ρ[A] = ρ[B] and μ[A] = μ[B]
Alternative hypothesis (
[MATH: H1
:MATH]
): ρ[A] ≠ ρ[B] or μ[A] ≠ μ[B].
MOCHA uses the two-part Wilcoxon (TP-W) test^[307]35 to combine results
from the binomial test on ρ[A] and ρ[B] with results from the Wilcoxon
rank-sum test on μ[A] and μ[B]. Since each test statistic can be
transformed to follow a
[MATH:
χ12
:MATH]
distribution (i.e.,
[MATH:
χ1,ρ2 :MATH]
and
[MATH:
χ1,μ2 :MATH]
), MOCHA combines them into a single test statistic, i.e.,
[MATH:
χ22
:MATH]
=
[MATH:
χ1,ρ2 :MATH]
+
[MATH:
χ1,μ2 :MATH]
, and consequently evaluates from it a single P value^[308]34. In the
absence of zeros, the TP-W test mathematically reduces to the standard
Wilcoxon rank-sum test.
Finally, to control for multiple testing, MOCHA calculates a modified
q-value to adjust for False Discovery^[309]109 for each tile and uses a
default threshold of 0.2 to identify DATs. Since the P values are
inflated near 1 (see Supplementary Fig. [310]10a), the background is
estimated from P ≤ 0.95 only.
In addition, MOCHA uses the Hodges-Lehmann estimator^[311]110 to
estimate log[2](fold change) on chromatin accessibility between the two
sample groups. More specifically, MOCHA first calculates the difference
between each sample pair (one sample each from group A or B) having
non-zero log[2](λ^(1) + 1) values and then takes the median from all
paired differences as an estimate for log[2](fold change) between the
two sample groups.
Benchmarking MOCHA with ArchR, Signac, DESeq2, and DiffChipL on DAA
ArchR’s, Signac’s, DESeq2’s, and DiffChipL’s DA modules were each run
on a single cell count matrix generated from the same tile set (215,649
tiles) as the COVID19X CD16 monocytes TSAM. For ArchR, default settings
were used, except we modified maxCells to include all cells
(n = 24,744). For Signac, we lowered the minimum percent detection (pct
= 0.001), and the log[2]FC threshold (logfc.threshold = 0.05) in order
to test the full tileset, thus enabling a full head-to-head comparison.
As a close analog of Signac’s tutorial, we also set latent.vars to
‘nFrags’ to adjust for sequencing depth. For DESeq2 and DiffChipL, the
single cell matrix was pseudobulked by sample to generate raw counts,
as required for DESeq2’s and DiffChipL’s workflow before running
differentials. Default settings for normalization and differentials
were used, in line with both method’s tutorials.
To assess the false positive rate (FPR, 1 - specificity) of the
methods, we conducted 50 permutation tests in which labels for COVID+
and COVID- samples were randomized and significant DATs were identified
as in the real data in each test. All DATs thus identified were
considered as false positives (FPs) and the FPR was evaluated as the
ratio of the number of FPs to the number of the original DATs from the
real data.
Due to a lack of “ground truth”, it is not possible to accurately
evaluate the recall (sensitivity) of the methods. Nevertheless, we
estimated a lower bound of the recall by downsampling the samples from
n = 39 to n = 38, 37, …, 30. More specifically, we randomly selected
the targeted number of samples, identified the DATs, and estimated the
recall as the ratio of the number of the new DATs to the number of the
original DATs. We repeated the process 15 times for each targeted
sample size. To save computing time, we limited the analysis to tiles
in chromosome 4 only. Some of the false negatives may be specific to
removed samples, due to disease/human heterogeneity. The loss of power
due to sample size reduction also increases the number of false
negatives. Thus the recall evaluated in the approach is likely a lower
bound.
Power analysis when downsampling
We conducted a power analysis to illustrate the theoretical power loss
when reducing the sample size from n = 39 to n = 30. We first
calculated the statistical power of detecting differences with a
moderate effect size (d = 0.5) under an
[MATH: α :MATH]
=0.05 using a 2-sample t-test (COVID+, n = 17; COVID-, n = 22). We then
downsampled from n = 39 to n = 30 and calculated the power with d = 0.5
and
[MATH: α :MATH]
=0.05 while maintaining the COVID+/COVID- sample ratio as close to
17/22 as possible. Finally, we divided the statistical power at each n,
with the power observed at n = 39, to calculate relative power loss due
to downsampling.
Assessing discriminative power per method
We randomly subsampled 50 DATs from the output of each method, ran
K-means clustering (K = 2), and generated the following confusion
matrix to summarize the predictions.
COVID+ COVID−
Cluster 1 a b a + b
Cluster 2 c d c + d
a + c b + d a + b + c + d
[312]Open in a new tab
We then calculated Holley’s^[313]50
[MATH:
G=(a+<
/mo>d)−(<
mi>b+c)a
+b+c+d :MATH]
to assess how well the 50 randomly selected DATs in separating COVID+
and COVID- samples. We used |G| for the comparison since it is
irrelevant which cluster is enriched for COVID+ samples. We repeated
this process 1,000 times to obtain a distribution for each method.
DAA runtime comparison
To evaluate each method’s speed in DAA, we started by testing all
215,649 tiles in the CD16 monocytes TSAM, and gradually decreased the
number of tested tiles. At each downsample, we tracked the run time
required to identify DATs within those randomly selected tiles. The
tictoc R package was used to calculate runtime.
Co-accessibility analysis
MOCHA’s zero-inflated method for CAA
MOCHA applies ZI Spearman correlation^[314]36,[315]37 to evaluate the
co-accessibility of two tiles (e.g., X and Y) across either cell types
or samples based on the corresponding log[2](λ^(1) + 1) values. More
specifically, the method first calculates the standard Spearman
correlation on (X,Y) pairs of non-zero data (i.e., X > 0 and Y > 0),
denoted as
[MATH:
ρS,11 :MATH]
, and then adjusts it in the presence of zeroes in either tile as
follows:
[MATH:
ρS*
=p11p+1
p1+
mrow>ρS,
11+3(p00p
mi>11−p10p01), :MATH]
7
where
[MATH:
p00=<
mi mathvariant="normal">P(X=0,Y=0), :MATH]
[MATH:
p10=<
mi mathvariant="normal">P(X>0,Y=0), :MATH]
[MATH:
p01=<
mi mathvariant="normal">PX=0,Y>0, :MATH]
[MATH:
p11=<
mi mathvariant="normal">P(X>0,Y>0),
:MATH]
[MATH:
p+1=p01+p11
, :MATH]
[MATH:
p1+=p10+p11
, :MATH]
which quantify how zeros are distributed among the two tiles across all
data points with
[MATH:
p00+<
msub>p10+p01+p11=
1 :MATH]
. In the absence of zeros, the ZI-Spearman correlation reduces to the
standard Spearman correlation, i.e.,
[MATH:
ρS*
=ρS,11 :MATH]
. MOCHA makes two modifications to an R implementation of the
method^[316]28: 1) The Spearman correlation (
[MATH:
ρS,11 :MATH]
) is calculated in C language for optimal computing time and 2)
undefined ZI Spearman correlations (when
[MATH:
ρS,11 :MATH]
cannot be calculated) are assigned to NA rather than replacing them
with the standard Spearman correlations with zeros treated as normal
data.
Benchmarking inter-cell-type co-accessibility
We used a previously published promoter-capture HiC (pcHiC)
resource^[317]54 which identified promoter-enhancer regulatory links.
From there, we used the liftOver R package, version 1.22.0, and the
Hg19 to Hg38 conversion file (hg19ToHg38.over.chain,
[318]https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/) to convert
promoter/enhancer loci from HG19 to HG38. The large HiC windows for
Promoters and enhancers were then tiled into 500 bp windows to generate
all potential promoter-enhancer tile (PET) pairs. We only kept PET
pairs when both tiles were identified as accessible by MOCHA in naive
CD4+ and CD8+ T Cells and had pcHiC evidence supporting their
interaction specifically in naive CD4+ and CD8+ T cells. We used this
final list of 1.2 million PET pairs as a foreground set of regions that
are enriched for biologically relevant interactions within these cell
types. We then generated a background set of randomly selected pairs of
tiles to generate an empirical null distribution. We calculated the
Spearman and ZI-spearman correlations across 17 cell types to identify
pairs of regions that are co-accessible in T cells vs other cell types
(17 cell types include B intermediate, B memory, B naïve, CD14 Mono,
CD16 Mono, CD4 Effector, CD4 Naïve, CD8 Effector, CD8 Naïve, DC, HSPC,
MAIT, NK, NK Proliferating, NK_CD56bright, OtherT, and Treg). We then
used the Kolmogorov–Smirnov (KS) distance to quantify how well the two
correlation methods separated the PET pairs from the random pairs. We
then calculated an empirical P value for each PET pair based on the
foreground and the empirical null distributions, and then accounted for
multiple comparisons (Benjamini-Hochberg method). A PET pair was
considered as significant if FDR < 0.1.
Pathway enrichment analysis
Pathway enrichment analysis was mostly restricted to the Reactome
pathway database. All genes within the database reference of
TxDb.Hsapiens.UCSC.hg38.refGene from Bioconductor^[319]111 were
selected as the background. Over-representation analysis was performed
using WebGestaltR^[320]112. We annotated enriched pathways at the
highest level within the Reactome’s database hierarchy. Lower level
annotations on immune system pathways were provided to discern
adaptive, innate, and general signaling pathways. Using WebGestaltR,
pathway enrichment analysis was performed once on Wikipathways, Gene
Ontology (Biological Processes, Non-redundant), and KEGG for
illustrative purposes.
Linkage disequilibrium score regression analysis
Linkage Disequilibrium Analysis was performed using the codebase from
[321]https://github.com/bulik/ldsc and reference files from
[322]https://zenodo.org/records/8292725. GWAS summary statistics were
pulled from EMBL’s GWAS catalog for studies GCST90029015^[323]113
(Autoimmune disease), GCST90038603^[324]114 (Immune Aging), and
GCST90043674^[325]115 (Abnormal Immune Response). Using the LDSC
scripts, munge_sumstats.py was used to generate summary statistics
files for heritability analysis. Peak calls on 9 cell types (across 3
datasets) by all 3 methods (total of 27 peaksets) from Fig. [326]2 were
split up by chromosome for computing efficiency and the make_annot.py
was used to generate an annotation file for each before modeling
(Source Data for Fig. [327]2). LDSC was then run (--l2 flag) on each
bed file of peak calls using reference files from above, generating a
file that could be used along with the reference baseline LD scores for
heritability analysis on each GWAS summary file. Overlap between
methods was then counted, when the same annotation was enriched in
different method’s peaksets from the same cell type, in the same
dataset, and using the same GWAS study’s results.
Identification of alternatively regulated transcription start sites
We extracted all TSSs from the Transcript database
TxDb.Hsapiens.UCSC.hg38.refGene found on BioConductor^[328]111, and
then expanded them upstream by 125 bp to account for TSSs falling very
close to a tile boundary. We filtered out genes with only one TSS. If
alternative TSSs of the same gene occurred within a user-defined
neighborhood (default: 150 bp) of each other, we collapsed them into a
single TSS. We then found the intersection between alternative TSSs and
the 6211 DATs between COVID+ and COVID- samples in CD16 monocytes. TSSs
that landed on a DAT were assigned with the FDR (q-value) of the
corresponding DATs. We categorized alternatively regulated genes (ARGs)
as
* Type I: A gene had a subset of TSSs showing differential
accessibility (FDR < 0.2) in the same direction and another subset
being open but not differential.
* Type II: A gene had at least two TSSs showing differential
accessibility (FDR < 0.2) but in opposite directions.
Motif enrichment analysis on alternatively regulated TSSs and associated
regulatory regions
Motif matching was done using the motifmatchr package and the CISBP
motif database, as provided by the chromVARmotif package
([329]https://github.com/GreenleafLab/chromVARmotifs). MOCHA uses a
standard hypergeometric test to identify enriched motifs, with a
user-provided foreground and background tile sets. For multi-testing
corrections, the resulting p-values were converted into FDRs
(Benjamini–Hochberg method). To understand the upstream signaling
mechanisms regulating ARGs, we first applied MOCHA to identify tiles
that were within ±1 M bp of and also co-accessible (inter-sample,
ZI-Spearman correlation > 0.5) with the corresponding DATs. These DATs
and their co-accessible tiles were selected as the foreground tile set.
For the background tile set, we chose all tiles with TSSs and their
co-accessible tiles that did not overlap with the foreground set. These
foreground and background tile sets were used to calculate CISBP motif
enrichment regulating ARGs.
Ligand-motif set enrichment analysis
The NicheNet^[330]56 database has identified links between upstream
ligands and downstream transcription factors (TFs) that regulate gene
expression^[331]56. Using the same principle as pathway enrichment
analysis, we designed a Ligand-Motif Set Enrichment Analysis (LMSEA)
framework to capture potential drivers of our observed motifs (i.e.,
ligands regulate the TFs in our dataset). Specifically, LMSEA tests
whether motifs linked to a ligand of interest are significantly (using
hypergeometric test) over-represented in our observed motifs relative
to the ligand’s motif set within NicheNet. The Benjamini and Hochberg
(BH) procedure was used to adjust P values for multiple comparisons. An
FDR value < 0.05 was considered significant.
Construction of ligand-transcription factor-gene network
We constructed ligand-TF-gene networks and visualized them using
Cytoscape^[332]116. The nodes were ARGs, enriched motifs (TFs), and
enriched ligands. Edges were drawn as follows: a motif-gene link was
created if an enriched TF was found within the TSS-containing DATs of
an ARG or their co-accessible tiles, a ligand-motif link was drawn if a
ligand was known to interact with a TF in NicheNet’s
ligand-transcription matrix.
Longitudinal analysis of COVID-19 response at single-cell level
Grouping COVID+ samples by infection stage
COVID+ samples (n = 69) in the COVID19 dataset were grouped by the
corresponding infection stage, including early infection (1–15 days
PSO, n = 21), late infection (16–30 days PSO, n = 13), and recovery
( > 30 days PSO, n = 35).
Generation of density UMAP
We extracted sample-specific open tiles on CD16 monocytes for all
samples in the full COVID19 dataset (n = 91). From there, we generated
a TSAM by aggregating all tiles that were called in at least 20% of
samples at any infection stage or uninfected. We extracted the tiles
from the resulting TSAM and added them to the original ArchR project
via addPeakSet. We then generated a single-cell peak matrix from this
tile set, using addPeakMatrix, and used it as input for ArchR’s
iterative LSI and UMAP functions. The LSI was run with default
parameters, except for the number of iterations (5 instead of 2). The
UMAP was run on standard ArchR settings^[333]11 on the resulting
iterative LSI object. Based on the resulting single-cell UMAPs, we
generated a density plot for each infection stage or uninfected.
Pseudotime trajectory analysis
We used ArchR’s standard Monocle3 pipeline to conduct a trajectory
analysis. We instructed Monocle to construct a trajectory from cells
belonging to samples in the order of early infection, late infection,
recovery, and uninfected. The resulting trajectory was overlaid on the
single-cell UMAP. Following the above trajectory, three distinct
pseudotime heatmaps were generated using ArchR’s standard protocol and
the following input single-cell matrices: log[2]-normalized GeneScores,
peak (tile) accessibility, and ChromVAR z-scores. Using ArchR’s
functions with default settings, we further extracted
pseudotime-changing elements for each of the three matrices.
Longitudinal analysis of COVID-19 response at pseudo-bulk level
Longitudinal analysis of motif usage
We modeled longitudinal motif usage using pseudobulk ChromVar motif
z-scores. We converted the TSAM of CD16 monocytes from the COVID19L
dataset (n = 69) into a ChromVAR-compatible object, and then ran
ChromVAR on the TSAM-derived object to generate sample-level motif
z-scores. We then modeled motif usage with the following generalized
linear mixed effect model (GLMM):
[MATH:
lmer(z~Age+Se
x+xi+xi<
/mi>2+
(1∣Subj
ect),data=
mo>mydata<
mo>) :MATH]
8
where
[MATH: xi
:MATH]
was the centered days PSO (i.e., days PSO of individual samples minus
the mean days PSO of all samples)^[334]117 and
[MATH:
(1∣Subject)
:MATH]
indicated that random intercepts were used for individual participants.
The P values associated with the linear
[MATH: xi
:MATH]
terms were extracted using the lmerTest package. We converted the P
values to FDRs (Benjamini-Hochberg method) to control multiple testing.
Motifs with a FDR < 0.1 were considered as significantly changing in
time.
Transcription factor network
The activator protein-1 (AP-1) family network was obtained by
subsetting the APID protein-protein interaction database^[335]118 down
to just the significant AP1-family TFs. Edges between nodes were
included if they were supported by at least four experiments. The nodes
were color-coded using the signs of the corresponding coefficient of
[MATH: xi
:MATH]
. The network was drawn using Cytoscape^[336]116.
Longitudinal analysis of gene promoter accessibility
We collected promoter tiles from the TSAM of CD16 monocytes in the
COVID19L dataset and modeled their accessibility using either GLMMs or
ZI-GLMMs with the glmmTMB package^[337]33. More specifically, for
promoters with zeroes, we applied the ZI-GLMM modeling as follows
[MATH:
glmmTMB<
mfenced
open="(">Log_Ac<
/mi>c~Age+S
mi>ex+Time+(1∣Su
mi>bject)<
/mrow>,zi=~0+CellCounts,data=m
ydata,famil
mi>y=gaussian()
, :MATH]
9
where
[MATH:
Log_Acc
:MATH]
was short for log[2](λ[j]^(1) + 1),
[MATH: Time :MATH]
was days PSO,
[MATH:
(1∣Subject)
:MATH]
indicated that random intercepts were used for individual participants,
and ZI was modeled as a function of the total cell counts in individual
samples with no intercept. For promoters without zeroes, we applied the
GLMM modeling as follows
[MATH:
glmmTMB<
mfenced
open="(">Log_Ac<
/mi>c~Age+S
mi>ex+Time+(1∣Su
mi>bject)<
/mrow>,zi=~0,
data=
mo>mydata,famil
mi>y=gaussian()
, :MATH]
10
where the ZI component was omitted. The P values associated with
[MATH: Time :MATH]
were extracted and converted to FDRs (Benjamini-Hochberg method) to
control multiple testing. Promoters with a FDR < 0.1 were considered as
significantly changing in time. For promoters attributed to multiple
genes, all genes were included for pathway enrichment and downstream
analyses.
Transcription factor and gene promoter associations
Linking transcription factor to gene promoters
We evaluated whether motif z-scores were statistically associated with
gene promoter accessibility via ZI-GLMMs as follows
[MATH:
glmmTMB<
mfenced
open="(">Log_Ac<
/mi>c~z,zi=
mo>~z,data=
mo>mydata,family
=gaussi<
/mi>an()<
/mrow>, :MATH]
11
where
[MATH:
Log_Acc
:MATH]
was short for log[2](λ[j]^(1)+1). All pairs of significantly changing
TFs and significantly changing gene promoters were evaluated. We
considered a TF and a gene promoter to be associated if the continuous
coefficient of
[MATH: z :MATH]
was statistically significant (P < 0.05) without adjusting for multiple
testing.
Linking transcription factor to innate immune pathways
Using the TF-gene promoter associations, we calculated the percentage
of significant genes in an innate immune pathway being associated with
a TF. For visualization purposes, the network only displayed an edge
between a TF and a pathway if more than 33% of significant genes in
that pathway were associated with the TF.
Variance decomposition on longitudinal COVID19 dataset
To identify significant covariates for modeling accessibility, we ran
variance decomposition on highly accessible tiles (defined as having at
least 80% non-zero values within any infection stage). We modeled Age,
Sex, time (i.e., days since symptom onset), Batch, and Donor as random
effects using the following formula:
[MATH: lmerLog_Acc~(1∣Age)
mo>+(1∣Sex)
mo>+(1∣time)+(1∣Batc
h)+(1∣Dono
r),data=mydata,fam<
mi>ily=gaussian(<
/mrow>),REML=T :MATH]
and then we used the relative variances to identify the most
influential factor for each tile.
Modeling zero-inflated associations across covariates
To identify factors associated with high presence of zeroes in scATAC
pseudo bulk data, we ran the following zero-inflated models
[MATH:
glmmTMB<
mfenced
open="(">Log_Ac<
/mi>c~Age+S
mi>ex+time+(1∣PT
mi>ID),zi=~0+CellCounts+Age+
Sex,data=
mo>mydata,family=gaussia
n(),REML=T :MATH]
on all promoter tiles to test whether zeroes were associated with Cell
Counts, Age, and Sex. We extracted the p-values from the glmmTMB and
summarized the associations using the histograms of the p-values.
Statistics and reproducibility
The datasets used in this manuscript were selected for benchmarking
purposes, and therefore there is no need for sample-size calculation,
randomization, or blinding. No samples were excluded from the analyses.
Any computational data cleaning and processing are described
accordingly.
Reporting summary
Further information on research design is available in the [338]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[339]Supplementary Information^ (12.1MB, pdf)
[340]Peer Review File NEW^ (396.8KB, pdf)
[341]Reporting Summary^ (246.6KB, pdf)
Acknowledgements