Abstract

   Bulk ATAC-seq assays have been used to map and profile the chromatin
   accessibility of regulatory elements such as enhancers, promoters, and
   insulators. This has provided great insight into the regulation of gene
   expression in many cell types in a variety of organisms. To date,
   ATAC-seq has most often been used to provide an average evaluation of
   chromatin accessibility in populations of cells. The development of a
   single cell approach (scATAC-seq) assay enables researchers to evaluate
   chromatin accessibility in individual cells and identify sub-groups in
   mixed populations of cells. To investigate the full potential of
   single-cell epigenomic data, we have comprehensively compared the
   information derived from bulk ATAC-seq and scATAC-seq in populations of
   cells. We found that the chromatin architecture signal is the same
   using bulk ATAC-seq and scATAC-seq to analyse aliquots of the same cell
   population. However, scATAC-seq provides substantially higher data
   quality compared to bulk ATAC-seq improving the sensitivity to detect
   relatively weak, but functionally important ATAC-seq signals.
   Furthermore, we found that scATAC-seq identified differences in what
   was previously assumed to be a homogenous population of cells. Finally,
   we determined the number of cells required to generate aggregated open
   chromatin profiles from single cells and to identify biologically
   meaningful clusters after pseudo-bulking of data. This study
   illustrates the added value of using scATAC-seq rather than bulk
   ATAC-seq in evaluating both homogeneous and heterogeneous populations
   of cells. This paper provides a comprehensive guide on the benefits of
   using scATAC-seq data to study gene regulation.

   Keywords: Bulk ATAC-seq, scATAC-seq, Gene regulation, Regulatory
   elements, Chromatin accessibility, Clustering, Pseudo-bulking,
   Single-cell epigenomics

   Subject terms: Computational biology and bioinformatics, Genetics

Introduction

   The assay for transposase accessible chromatin using sequencing
   (ATAC-seq) can be used to profile open chromatin accessibility
   genome-wide^[26]1. The dynamic nature of chromatin accessibility
   reflects and correlates with the activity of genomic regulatory
   elements including enhancers, promoters and insulators. These elements
   account for a major proportion of the active non-coding genome in any
   cell type and contribute to the control of gene activity. They can
   initiate, activate, or suppress gene expression in a cell-type and
   developmental-stage-specific manner. Defining where and when these
   regulatory elements are active is crucial in understanding gene
   regulation and non-coding human genetics^[27]2.

   The ATAC-seq assay has generated maps of regulatory elements in various
   cell types across many organisms. Nonetheless, these samples have often
   consisted of populations of cells, hence ATAC-seq describes the average
   chromatin accessibility of all cells in a sample. To characterize
   cell-type specific regulatory elements, many cell types or stages of
   differentiation in the lineage of these cell types need to be isolated
   prior to performing ATAC sequencing. Generating such collections with
   traditional ATAC-seq therefore requires intensive effort and resources.
   Additionally, isolation of particular cell types is not always
   feasible^[28]3,[29]4.

   The aggregated data from all cells within a sample cannot identify any
   inherent cellular and regulatory heterogeneity of differing cell types
   within the sample. The ability to isolate defined populations of cells
   from their heterogeneous tissue of origin is limited by the
   availability of methods for their identification and their
   isolation^[30]5,[31]6. Recent technological and computational advances
   have made it possible to map chromatin accessibility at a single-cell
   level with single cell ATAC-seq (scATAC-seq), which has allowed for the
   simultaneous analysis of chromatin accessibility landscapes of
   heterogeneous mixtures of cells^[32]7. These outputs can be used to
   computationally cluster cells with similar chromatin accessibility
   profiles within a complex tissue type or a mixture of heterogeneous
   cell populations. This can lead to discovering of new cell populations,
   identifying rare cell types and defining new chromatin accessibility
   landscapes throughput processes such as differentiation and cellular
   activation. Aggregation of the data from these cell clusters
   (pseudo-bulking) can then be used to generate a genome-wide chromatin
   accessibility signal analogous to bulk ATAC-seq from a purified
   population^[33]5,[34]8,[35]9.

   Resolving a heterogeneous population of cells in this way requires much
   less technical effort and cost than would be required with standard
   ATAC-seq. Nevertheless, scATAC-seq is a relatively novel technology and
   several questions need to be answered to understand its full
   capabilities^[36]9,[37]10. First, we have addressed whether the
   chromatin accessibility profiles from bulk ATAC-seq and scATAC-seq
   differ in their structure. In this study, we compare the profiles
   derived from bulk ATAC-seq and scATAC-seq data from late erythroblast
   cells differentiated from CD34 cells from the same individual. We also
   used previously generated bulk ATAC-seq data from Natural Killer cells
   and compared this to publicly available scATAC-seq data from peripheral
   blood mononuclear cells (PBMCs) and see comparable results. Secondly,
   we have established the minimum number of cells required to annotate
   the genome of a heterogeneous cell cluster and have assessed the effect
   of cell number on data quality. This study therefore provides general
   guidance on the application and interpretability of single cell
   ATAC-seq from experiment to data analysis.

Methods

ATAC-seq data analysis for erythroblast and natural killer cells

   Bulk ATAC-seq erythroblast data from a normal human donor was generated
   as part of a collaboration^[38]11. Bulk ATAC-seq data for Natural
   Killer (NK) cells was utilized in this study, previously generated in
   our lab from the same donor as the erythroblast dataset^[39]12. The
   ATAC-seq libraries were prepared and sequenced according to standard
   protocols to ensure high-quality chromatin accessibility profiling.
   Detailed information on the data generation process can be found in
   ^[40]11 and ^12. The raw sequencing reads were aligned to the human
   genome hg19 or hg38 using Bowtie2 (‘-N 1’) (v2.4.2), only properly
   mapped read pairs were retained (42,060,324 reads for erythroblast)
   followed by removing low-quality reads (MAPQ < 30), unmapped reads, and
   PCR duplicates from the aligned data using SAMtools (v1.17). After the
   data quality filter step, 17,586,680 reads remained for erythroblast
   and 38,368,712 reads for NK cells. A coverage track was generated using
   deepTools BamCoverage (‘bamCoverage --binSize 1 –normalizeUsingRPKM’)
   (v3.5.2) to visualise bulk ATAC-seq erythroblast data on the UCSC
   genome browser (Fig. [41]1B). Finally, peak calling was performed via
   Lanceotron (lanceotron callPeaks), called peaks in both datasets were
   filtered by peak score (> 0.5), and blacklisted regions were discarded
   from the peak files.

Fig. 1.

   [42]Fig. 1
   [43]Open in a new tab

   The scATAC-seq erythroblast data retains the same signal architecture
   with much fewer cells than bulk ATAC-seq erythroblast data. (A)
   Sequencing information for bulk ATAC-seq, scATAC-seq and downsampled
   versions of both. In the quality control step, firstly, only properly
   mapped pairs with MAPQ > 30 are extracted. Then, PCR duplicates were
   removed from those reads. (B) Downsampled scATAC-seq erythroblast data
   retains signal structure more accurately than bulk ATAC-seq. Comparison
   of data quality for the erythroblast dataset in scATAC-seq (blue
   track), downsampled scATAC-seq (purple track, downsampled from
   160,957,352 reads to ~ 17,586,680 reads, ~ 11% of scATAC-seq reads),
   bulk ATAC-seq (black track) and downsampled bulk ATAC-seq (orange
   track, downsampled from 80,000 cells to 4,485 cells) at the
   alpha-globin locus. The distribution of DNA fragment sizes for the
   erythroblast dataset in bulk ATAC-seq (black) and scATAC-seq (purple)
   (C) shows a similar distribution trend with the difference in
   sequencing depth. The x-axis of distribution plots represents the
   fragment size of tagmented DNA, the y-axis shows a count frequency for
   those fragments. The decrease at around 150 bp (mono-nucleosomal) and
   300 bp (di-nucleosomal) refers to the nucleosomal positioning. (D) The
   distribution of DNA fragment sizes within peaks (the left panel) and
   outside peaks (the right panel) in bulk ATAC-seq (black) and scATAC-seq
   (purple). (E) 50,000 regions from outside the common peak set were
   randomly selected and expanded by 2 kb on each side. The read count for
   these background regions is shown in blue (bulk ATAC-seq), green
   (scATAC-seq) and red lines (the downsampled scATAC-seq). (F) The
   average chromatin accessibility signal generated in the two assays over
   subclasses of active regions: CTCF, Promoter, Enhancer, Promoter/CTCF,
   Enhancer/CTCF.

scATAC-seq data analysis for erythroblast

10X analysis

   The cellranger-atac count pipeline (10X Genomics Cell Ranger ATAC,
   v1.1.0) was used to analyse scATAC-seq erythroblast data. It aligned
   sequencing reads to the human genome hg19, detected peaks and generated
   a cell-by-peak matrix. 10X pipeline also generated a sorted and indexed
   BAM file containing all reads. The sequencing reads in the BAM files
   were filtered using SAMtools (v1.17) to keep only high-quality reads
   for downstream analysis. Only properly mapped read pairs were retained
   with the following flag ‘-f 3’ (231,097,116 reads) and all unmapped
   reads were filtered out with the following flag ‘-F 4’. Reads with
   mapping quality (MAPQ) score of 30 or higher were retained and
   duplicates were removed (samtools rmdup) (160,957,352 reads). For
   visualisation purposes, the coverage tracks were generated using
   deepTools (v3.5.2) with the following parameters: ‘bamCoverage
   --binSize 1 –normalizeUsingRPKM’. Peak calling was then performed on
   RPKM normalized bigwig file via LanceOtron (lanceotron
   callPeaks)^[44]13. Called peaks were then filtered by removing
   blacklist regions and retained peaks with an overall peak score greater
   than 0.5. Finally, data were visualised on the UCSC genome
   browser^[45]14 (Fig. [46]1B).

cisTopic analysis

   After running the 10X count pipeline on scATAC-seq the erythroblast
   data, the cisTopic tool was used for downstream analysis. cisTopic
   leverages topic modelling, particularly Latent Dirichlet Allocation
   (LDA), to identify co-accessible chromatin regions (topics) across
   cells, which is more biologically meaningful than clustering cells or
   peaks separately^[47]15. The cisTopic object was created using the
   cell-by-peak matrix from the 10X results, followed by running the
   runWarpLDAModels function with the following parameters:
     * α = 50,
     * β = 0.1,
     * iterations = 500,
     * a number of topics = x, where x is in the range of 2 and 100 (2,
       10, from 20 to 60, 1 by 1; from 70 to 100, 10 by 10).

   With the selectModel function, the number of topics was determined as
   48 after assessing the second derivative of the likelihood curve and
   perplexity. Then, for visualisation purposes, 2D t-Distributed
   Stochastic Neighbour Embedding (t-SNE) were calculated using the Rtsne
   R package (v0.17) (Fig. [48]3A).

Fig. 3.

   [49]Fig. 3
   [50]Open in a new tab

   scATAC-seq can detect differences in chromatin accessibility in a
   presumed homogenous population. (A) Dimensionality reduction analysis
   for scATAC-seq of a population of erythroblasts is shown as a t-SNE
   projection. Each dot represents a single cell. Cells within the dotted
   purple circle demonstrate a different epigenomic landscape from other
   cells. (B) Genomic location-based peak annotation for the large cluster
   (top) and the small cluster (bottom). Peaks in the small cluster are
   identified as very close promoters, whereas the activity of distal
   regulatory elements seems to diminish. (C) Gene ontology enrichment
   result of the small cluster highlights regulation of the processes
   known to be involved in the ejection of the nucleus during red blood
   cell development. GO enrichment results of biological process (BP) in
   the small cluster (pink dots) and the large cluster (green dots) are
   shown as a dot plot. The x-axis shows the gene count ratio as a
   percentage, whereas the y-axis refers to enriched biological processes
   that are found by the algorithm after the false discovery rate (FDR)
   control.

Gene ontology (GO) enrichment analysis

   After performing cisTopic analysis on scATAC-seq erythroblast data, we
   detected two main clusters (Fig. [51]3A). To further understand the
   difference between these clusters, gene ontology (GO) enrichment
   analysis was performed on both peak sets (the small and large clusters
   in Fig. [52]3A). For GO enrichment analysis, enrichGO function from the
   clusterProfiler R package (v4.8.1) was used with default parameters for
   three orthogonal ontologies (biological process (BP), cellular
   component (CC) and molecular function (MF))^[53]16. For pathway
   enrichment analysis, enrichKEGG function from the same R package was
   used with default parameters. All enrichment results for two peak sets
   were merged in R using the tidyverse package (v2.0.0).

Peripheral blood mononuclear cells (PBMCs) data analysis

10X analysis

   To reduce computation running time, a small PBMC multiome dataset
   containing approximately 3,000 cells (data:
   [54]https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-gran
   ulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0) was used to
   demonstrate the required number of cells to form robust clusters.
   Therefore, the ATAC fragment information file and filtered feature
   barcode matrix, analysed by Cell Ranger ARC (10X Genomics, v2.0.0),
   were downloaded from the 10X Genomic website^[55]17.

ArchR analysis

   After downloading the fragments file from the 10X Genomics for
   scATAC-seq PBMC, the ArchR package (v1.0.2) was used to form
   clusters^[56]18. Firstly, an arrow file was created with the following
   parameters:
     * filterTSS = 4,
     * filterFrags = 1000,
     * addTileMat = TRUE,
     * addGeneScoreMat = TRUE.

   Then, the ArchR project was created after calculating doublet scores
   and filtering out doublets based on these scores (using the
   addDoubletScores function with default parameters). The workflow was
   explained below with default parameters used for each step.
    1. The iterative LSI dimensionality reduction was implemented using
       the addIterativeLSI function on the tile matrix.
    2. Uniform Manifold Approximation and Projection (UMAP) was added via
       the addUMAP function.
    3. The plotEmbedding function was then used to visualise data in 2D
       space and cells were coloured by cluster information.
    4. Finally, the addClusters function was used to form clusters
       (Fig. [57]5).

Fig. 5.

   [58]Fig. 5
   [59]Open in a new tab

   In silico experiment shows that clustering is still possible even with
   40 cells. Each UMAP displays clustering results on different cell
   numbers for the erythroblast data as UMAP projections using ArchR. Each
   dot represents an individual cell and is coloured by cell type, as seen
   in the legend. (A) 300 cells; (B) 150 cells; (C) 80 cells; (D) 40
   cells; and (E) 20 cells. The MLV session can be accessed through the
   link. [60]https://mlv.molbiol.ox.ac.uk/projects/sc_atac_seq/5970.

Natural Killer (NK) cells from PBMC data analysis

   Additionally, we utilized previously generated data with bulk ATAC-seq
   NK cells in the lab to make a robust comparison between the two
   methods. Since we did not have matching scATAC-seq NK cells from the
   same donor, we used publicly available 10X Genomics PBMC datasets
   (Table [61]1). To reach a similar number of cells as we had in
   scATAC-seq erythroblast (4,485 cells) data, we merged five different
   PBMC datasets. For those five datasets, their “filtered_peak_bc_matrix”
   files in h5 format were downloaded. These matrices were uploaded into
   the Azimuth ATAC app
   ([62]https://app.azimuth.hubmapconsortium.org/app/human-pbmc-atac) for
   preprocessing and annotation analysis^[63]19.

Table 1.

   A list of five different 10X Genomics PBMC datasets and their sources.
   10X Datasets                          Source
   10k PBMC using Next Gem
   [64]https://www.10xgenomics.com/datasets/10-k-peripheral-blood-mononucl
   ear-cells-pbm-cs-from-a-healthy-donor-next-gem-v-1-1-1-1-standard-2-0-0
   10k PBMC using Chromium Controller
   [65]https://www.10xgenomics.com/datasets/10-k-human-pbm-cs-atac-v-1-1-c
   hromium-controller-1-1-standard-2-0-0
   10k PBMC using Chromium X
   [66]https://www.10xgenomics.com/datasets/10-k-human-pbm-cs-atac-v-1-1-c
   hromium-x-1-1-standard-2-0-0
   10k PBMC using Chromium Controller v2
   [67]https://www.10xgenomics.com/datasets/10k-human-pbmcs-atac-v2-chromi
   um-controller-2-standard
   10k PBMC using Chromium X v2
   [68]https://www.10xgenomics.com/datasets/10k-human-pbmcs-atac-v2-chromi
   um-x-2-standard
   [69]Open in a new tab

   Cell barcode (CB) tag, which contains a unique identifier for each
   cell, related to NK cells (4,119 cells) were extracted from Azimuth
   ATAC annotation results (Supplementary Fig. 4A). 10X PBMC BAM files
   were first filtered as described for erythroblast scATAC-seq data
   analysis. And then further subsetted to extract NK cells using a
   modified version of the online script
   ([70]https://divingintogeneticsandgenomics.com/post/split-a-10xscatac-b
   am-file-by-cluster/). The coverage tracks were generated using
   deepTools (v3.5.2) as explained before. Finally, peak calling was
   performed using LanceOtron (lanceotron callPeaks)^[71]13, followed by
   filtering called peaks by peak score (> 0.5), and removing blacklisted
   regions from the peak file.

Peak annotation

Genomic coordinates-based peak annotation

   The filtered peak files from bulk ATAC-seq and scATAC-seq were imported
   into R separately using the tidyverse R package (v2.0.0) as data frames
   and converted into GRanges objects with using the GenomicRanges R
   package (v1.52.1). annotatePeak function from the ChIPseeker R package
   (v1.36.0) was used to annotate peaks based on the UCSC genome
   annotation. The results from annotatePeak function were visualised
   using the ggplot2 R package (v3.5.1) with the following parameters:
   geom_bar(stat = “identity”) (Figs. [72]2A and [73]3B and Supplementary
   Fig. [74]4B).

Fig. 2.

   Fig. 2
   [75]Open in a new tab

   scATAC-seq identifies many more open chromatin regions than bulk
   ATAC-seq particularly, CTCF-bound-regions and potential
   primed-distal-elements. (A) Genomic location-based peak annotation for
   bulk ATAC-seq (left) and scATAC-seq peaks (right). scATAC-seq peaks
   provide a richer source of information on distal regulatory elements
   when compared to bulk ATAC-seq peaks. (B) The difference between bulk
   ATAC-seq and scATAC-seq peaks is shown as a Venn diagram representing
   the total number of peaks in each sample. (C) Peaks in bulk ATAC-seq
   were centred and expanded 2 kb from each side. Read count for those
   fixed regions was acquired from H3K4me1, H3K4me3, H3K27ac and CTCF.
   Then, peaks are categorised based on their epigenetic signatures. The
   peak annotation results for bulk ATAC-seq peaks by using publicly
   available ChIP-seq datasets show bulk ATAC-seq poorly identifies weak
   open chromatin regions such as CTCF sites. (D) Peaks in scATAC-seq were
   centred and expanded as 2 kb from each side. Read count for those fixed
   regions was acquired from, respectively, H3K4me1, H3K4me3, H3K27ac and
   CTCF. Then, peaks are categorised based on their on their epigenetic
   signatures. The peak annotation results for scATAC-seq peaks by using
   in-house ChIP-seq datasets show that scATAC-seq is more sensitive in
   the detection of CTCF sites. (E) The first bar plot represents the
   total number of classical CTCF peaks, the rest shows the total number
   of overlapping peaks between CTCF and open chromatin data (purple:
   scATAC-seq, blue: downsampled scATAC-seq (~ 11% of scATAC-seq reads)
   and red: bulk ATAC-seq). Related MLV^[76]25 session can be accessed
   through the link,
   [77]https://mlv.molbiol.ox.ac.uk/projects/multi_locus_view/6274.

Peak annotation using publicly available ChIP-seq marks

   Another way of annotating peaks is using specific ChIP-seq marks, such
   as H3K4me1, H3K4me3, and H3K27ac, to detect primed enhancer, promoter
   and general transcription activity^[78]20. The publicly available
   ChIP-seq datasets of these histone modifications, CTCF, and background
   input were used for this purpose^[79]21. For visualisation in the UCSC
   Genome Browser, coverage tracks were generated using deepTools
   BamCoverage (v3.5.2) for ChIP-seq BAM files. For both peak files, all
   peaks were extended by 2 kb from the center of the peak in each
   direction (in Python) to capture the ChIP-seq signal correctly.
   Coverage information was acquired from ChIP-seq BAM files for these
   extended peaks in both peak sets via bedtools multicov. Once read
   counts were obtained, the extended peak files were sorted by the
   difference between H3K4me1 and H3K4me3. For plotting, computeMatrix
   reference-point from deepTools (v3.5.2) was used to create matrices for
   bulk ATAC-seq and scATAC-seq data between ChIP-seq marks and peak files
   using the following parameters:
     * --referencePoint = center,
     * --beforeRegionStartLength = 2000,
     * --afterRegionStartLength = 2000,
     * --sortRegions = keep,
     * --skipZeros,
     * --numberOfProcessors = max.

   Once matrices were generated, plotHeatmap from deepTools (v3.5.2) was
   performed to create a heatmap of each peak in each peak file as well as
   a line plot for average profile. Heatmaps and line plots were modified
   in Python using pandas (v2.1.4), NumPy (v1.24.3), matplotlib (v3.7.1)
   and seaborn (v0.12.2) packages. Then, peaks were categorised based on
   the read coverage from ChIP-seq data (Figs. [80]1E-F, [81]2C-D and
   [82]4C)^[83]20.

Fig. 4.

   Fig. 4
   [84]Open in a new tab

   Data quality in scATAC-seq erythroblast data is high, with even 500
   cells. (A) Comparison of data quality for scATAC-seq erythroblast data
   to cell number at the alpha-globin locus. The black track, 4,485 cells;
   the blue track, 2,000 cells; the orange track, 1,000 cells; the purple
   track, 500 cells; the yellow track, 100 cells; the grey track, 50
   cells. (B) The total number of overlapping peaks between scATAC-seq
   erythroblast and its downsampled versions. (C) Peak annotation results
   for peaks from scATAC-seq erythroblast downsampled data containing 500
   cells using the ChIP-seq datasets shows that as the number of cells
   reduces, the sensitivity of scATAC-seq data to identify CTCF sites
   diminished. Peaks in scATAC-seq data, containing 500 cells, were
   centred and expanded as 2 kb from each side. Read count for those fixed
   regions was acquired from H3K4me1, H3K4me3, H3K27ac and CTCF. Then,
   peaks are categorised based on their read coverage values. (D) The
   first bar plot represents the total number of classical CTCF peaks that
   were detected from ChIP-seq CTCF from the same individual, the
   subsequent bars show the total number of overlapping peaks between CTCF
   and scATAC-seq erythroblast downsampled versions.

   For supplementary Fig. [85]2, only peaks that are unique to scATAC-seq
   data were retained (57,586 peaks in Fig. [86]2B) and performed
   genome-wide heatmap using the same procedure explained before for the
   peak annotation using publicly available ChIP-seq marks (Supplementary
   Fig. [87]2).

   For the NK cell population, publicly available ChIP-seq datasets
   (H3K4me1, H3K4me3 and CTCF) were downloaded from the ENCODE
   database^[88]22–[89]24. The same analysis steps performed for the
   erythroblast data were also applied here (Supplementary Fig. [90]4C).

In silico downsampling experiment

Homogenous data, erythroblast

   scATAC-seq erythroblast data, which contains 4,485 cells, was
   downsampled to 2,000, 1,000, 500, 100, and 50 cells using a custom
   pipeline with Python packages, including pandas (v2.1.4), pysam
   (v0.21.0), tqdm (v4.65.0), os (v3.11.3), matplotlib (v3.7.1) and
   seaborn (v0.12.2).

   The pipeline includes the following steps:
     * Extracting CB tags from the analysis folder of the 10X count
       pipeline.
     * Plotting read distribution for the tags.
     * Downsampling the total number of cells to 2,000, 1,000, 500, 100
       and 50 cells by randomly selecting CB tags in turn.
     * Plotting read distribution for randomly selected tags.
     * Subsetting the data by the total number of CB tags, and creating
       new BAM files.
     * Sorting and indexing new BAM files.

   After generating downsampled BAM files, the coverage tracks were
   generated using deepTools BamCoverage (v3.5.2) and uploaded onto the
   UCSC Genome Browser (Fig. [91]4A). Finally, peak calling was performed
   using LanceOtron (lanceotron callPeaks)^[92]13, followed by filtering
   called peaks by peak score (> 0.5), and removing blacklisted regions
   from the peak file.

ChIP-seq peak intersection with various downsampled scATAC-seq peaks

   scATAC-seq erythroblast data, which contains 4,485 cells, was
   additionally downsampled to 400, 300, and 200 cells using the same
   script explained previously for in silico downsampling experiment to
   include more data. Finally, peak calling was performed using LanceOtron
   (lanceotron callPeaks) on both different versions of downsampled
   scATAC-seq sets (4485, 2000, 1000, 500, 400, 300, 200, 100 cells) and
   ChIP-seq promoter, enhancer and CTCF data^[93]13. bedtools intersect
   with the ‘-loj’ parameter was used to find overlapping regions between
   these scATAC-seq datasets and ChIP-seq datasets. This command outputs
   each peak in the ChIP-seq datasets along with overlapping regions from
   the scATAC-seq datasets. If there is no overlap, the output file keeps
   non-overlapping peak regions by filling them with placeholder values.
   The bar plots showing the number of overlaps between ChIP-seq datasets
   and scATAC-seq datasets were generated using the tidyverse R package
   (v2.0.0) (Fig. [94]4D, Supplementary Fig. [95]8A and B).

In silico dilution experiment

   In silico experiment integrated two filtered scATAC-seq datasets: PBMC
   and erythroblast. Analysis of scATAC-seq erythroblasts showed that
   data, indeed, consists of two clusters: erythroblast and late
   erythroblast (Fig. [96]3). The large erythroblast cluster was extracted
   using CB tags in Python. Selected clusters and their total number of
   cells are as follows:
     * Monocytes cluster: 842 cells,
     * CD4 T cluster: 758 cells,
     * CD8 naïve cluster: 288 cells,
     * B cells cluster: 198 cells,
     * CD8-other cluster: 129 cells,
     * NK cells cluster: 106 cells,
     * pDC cells cluster: 29 cells.

   A custom-made pipeline was used with the steps outlined as follows:
    1. Extracting CB tags for homogenous erythroblast data (Fig. [97]5)
       and heterogenous PBMC data (B cells: Supplementary Fig. [98]9 and
       CD8 naïve cells: Supplementary Fig. [99]10).
    2. Plotting read distribution for those tags.
    3. Selecting CB tags randomly from the list in point 1, so the total
       number of diluted cells is 300 (198 for B and CD8 naïve cells),
       150, 80, 40, and 20 cells.
    4. Plotting read distribution for the randomly selected tags.
    5. Subsetting the data by the total number of CB tags, and creating
       new BAM files.
    6. Sorting and indexing new BAM files.
    7. Merging subsetted homogenous erythroblast data with curated PBMC
       data.
    8. Indexing merged BAM file.
    9. Performing ArchR analysis on the merged data.
   10. Merging different results from ArchR into one metadata and
       colouring UMAP by cell type and cluster ID (Fig. [100]5).
   11. Saving metadata into a file in text format.

Assigning cell identity to clusters using multiome snATAC-seq and snRNA-seq

   Azimuth app ([101]https://azimuth.hubmapconsortium.org/) was used to
   analyse the gene expression matrix of snRNA of multiome dataset to get
   cell type annotation^[102]19. Azimuth performs cell identity assignment
   to clusters by leveraging the reference-based mapping method after
   normalising the count matrix. Annotation result provides three
   different levels of annotation depending on their granularity. Results
   from three different annotation methods were integrated together in R
   using the tidyverse R package (v2.0.0) and visualised in Multi-Locus
   View (MLV)^[103]25 (Fig. [104]5).

Graphical analysis

Genome-wide chromatin accessibility patterns in bulk ATAC-seq and scATAC-seq
erythroblast data

   Only common peaks between bulk ATAC-seq and scATAC-seq data were
   retained (13,680 peaks in Fig. [105]2B) for this analysis. The common
   peak set was sorted by chromatin accessibility signal from scATAC-seq
   in descending order and this order was kept when generating the heatmap
   as explained previously for peak annotation using publicly available
   ChIP-seq marks (Supplementary Fig. [106]1A).

ATAC DNA fragment distribution

   First of all, the distributions of DNA fragments for both bulk ATAC-seq
   and scATAC-seq data was plotted using all reads on chr16 with the R
   packages rtracklayer (v1.60.1) and Rsamtools (v2.16.0). The ggplot2 R
   package (v3.5.1) was used to plot the distribution using geom_line
   (Fig. [107]1C). Secondly, reads overlapping with common peak regions
   (13,680 peaks in Fig. [108]2B) were extracted from both bulk ATAC-seq
   and scATAC-seq data using SAMtools view (v1.17) with the ‘-L’
   parameter. Next, these reads were excluded from both data to represent
   non-overlapping reads with the common peak file using bedtools
   intersect (v2.29.2) with the ‘-v’ parameter. The distributions of DNA
   fragments within and outside peak regions in both assays were plotted
   as explained before (Fig. [109]1D).

Background estimation plot

   Bedtools genomecov (v2.29.2) with the parameter ‘-bga’ is used to
   compute genome coverage by counting the number of reads that cover each
   base for bulk ATAC-seq and scATAC-seq erythroblast data. The base
   coverage files for bulk ATAC-seq and scATAC-seq are then used to
   calculate the percentage of total reads at each coverage level relative
   to the total reads in each data and visualise the results as
   percentages in the scatter plot (Supplementary Fig. [110]1B).

Chromatin signal pattern outside the peak regions in erythroblast datasets

   4 kb long 50,000 genomic regions were randomly selected from outside
   the union peak sets of bulk ATAC-seq and scATAC-seq. computeMatrix
   reference-point from deepTools (v3.5.2) was used to generate a signal
   matrix across the specified regions for comparison of signal patterns
   in the bulk ATAC, the scATAC-seq and the downsampled scATAC-seq data
   sets. Then, the matrix was imported into Python for visualization using
   pandas (v2.1.4), NumPy (v1.24.3), matplotlib (v3.7.1) and seaborn
   (v0.12.2) packages. The comparative line plots show chromatin signal
   across genomic positions (4 kb long 50,000 genomic regions outside the
   union peak sets) for three datasets (Fig. [111]1E).

Results

scATAC-seq improves data sensitivity compared to bulk ATAC-seq data

   Although scATAC-seq data provides chromatin accessibility profiles at a
   single cell level, it is not clear whether it fully recapitulates the
   same signal structure as seen using bulk ATAC-seq data. In a previous
   study^[112]11, we generated data with both bulk ATAC-seq and scATAC-seq
   using the same input material allowing us to compare the two methods.
   Bulk ATAC and scATAC sequencing were performed on late erythroblasts
   derived from ex vivo differentiated CD34 + haematopoietic stem cells
   from a normal human donor. The input cell number for bulk ATAC-seq was
   around 80,000 cells, while for scATAC-seq was around 7,000 nuclei
   (4,485 cells recovered) and was sequenced to the recommended depth per
   cell for the 10X protocol (Fig. [113]1A)^[114]11. As the population was
   considered to be essentially homogeneous, the aggregate of all the
   single-cell data was used to compare bulk ATAC-seq and scATAC-seq from
   the same cells as genome-wide tracks of enrichment.

   Visual assessment of the aggregated scATAC-seq data, at specific
   well-characterised active loci, such as the alpha-globin locus, showed
   very similar open chromatin patterns to the bulk ATAC-seq, although
   both data quality and depth appeared superior in scATAC-seq especially
   when the bulk signal was downsampled for number of input cells
   (Fig. [115]1A and B). Also, a genome-wide analysis of the common peaks
   called in both the bulk ATAC-seq and scATAC-seq showed them to be
   quantitatively similar in their intrinsic peak strength. Therefore,
   sorting by the signal strength of scATAC-seq from high to low, also
   sorts the bulk peaks from high to low signal (Supplementary
   Fig. [116]1A). These results showed that technology difference did not
   change the overall signal architecture, pattern of open chromatin sites
   or even relative signal strength of the peaks with the largest peaks in
   scATAC-seq also being the largest peaks in the bulk (Supplementary
   Fig. [117]1A). This, therefore, showed that aggregated scATAC-seq data
   can be used in an analogous manner to bulk ATAC-seq, but also suggested
   that scATAC-seq data from a substantial smaller number of cells could
   generate equivalent or even superior data to bulk ATAC-seq from a much
   larger number of cells.

   Next, we tested if differences in the products of the transposition
   reaction in each technology might be different and so explain the
   differences in performance. The fragment sizes generated by each
   technology both show the same evident di-nucleosomal, mono-nucleosomal
   and sub-nucleosomal fragment distribution pattern in both bulk ATAC-seq
   (black line) and scATAC-seq (purple line) (Fig. [118]1C). While this
   shows that the Tn5 tagments chromatin into nucleosomal and
   subnucleosomal sized fragments in both technologies, an enrichment for
   both sub-nucleosomal and mono-nucleosomal fragments was evident within
   peak regions in scATAC-seq (Fig. [119]1D). In contrast, outside of peak
   regions in scATAC-seq was seen to be depleted for sub-nucleosomal but
   enriched for mono-nucleosomal fragments (Fig. [120]1D). Consequently,
   the background signal in bulk ATAC-seq is enriched for sub-nucleosomal
   fragments and depleted for nucleosomal fragments, relative to
   scATAC-seq. One potential reason for the apparent increase in signal in
   scATAC-seq could be that the bulk ATAC-seq protocol generates higher
   background due to its different tagmentation behaviour, which must be
   sequenced through to detect the signal. We, therefore, randomly
   selected 50,000 genomic regions (4 kb long) from outside the peak set
   and piled up the bulk, the full scATAC-seq and scATAC-seq signals
   subsampled to the same read depth as the bulk, over these inactive
   regions (Fig. [121]1E). While the background in the highly sequenced
   scATAC-seq dataset (green line) obviously appears much higher than in
   the bulk ATAC-seq (blue line), when the scATAC-seq is downsampled to
   the same sequence depth as the bulk sample (red line) levels of
   background appear comparable. Therefore, it appears that the increased
   sensitivity of scATAC-seq is not due to decreased levels of background
   generated in the scATAC-seq protocol. We next assessed the relative
   levels of signal generated in the two assays over active regions
   subclassified by types of active regions (‘CTCF’, ‘Promoter’,
   ‘Enhancer’, ‘Promoter/CTCF’, ‘Enhancer/CTCF’) (Fig. [122]1F). To
   perform this analysis additional erythroblast ChIP-seq datasets were
   used to determine regulatory element classes: H3K4me1, primed enhancer
   marker; H3K4me3, promoter marker; H3K27ac, active chromatin marker; and
   CTCF, boundary elements. With these data, both bulk ATAC-seq and
   scATAC-seq open chromatin peaks were annotated as to whether they were
   likely enhancers, promoters or CTCF sites. In each peak category, the
   downsampled scATAC-seq showed increased signal over each class of
   active element which was most evident over enhancers elements and CTCF
   bound regions. Using the scATAC-seq data at the recommended sequence
   depth (green line in Fig. [123]1F) showed an extremely large increase
   in signal-to-background. Considering this data was generated from only
   4,485 cells (compared to 80,000 for bulk ATAC-seq), it showed that the
   scATAC-seq protocol generates extremely complex and information-rich
   ATAC libraries (Supplementary Fig. [124]1B). These high levels of
   library complexity, combined with increased signal over activity
   elements, therefore likely leads to the high levels of sensitivity per
   cell seen in scATAC-seq relative to bulk ATAC-seq.

   This increased signal-to-noise within the scATAC-seq has the potential
   to increase the sensitivity of the scATAC-seq for the detection of open
   chromatin sites. Therefore, we compared the peak called output of the
   two data types and the number and types of active regions detected were
   compared. To do this, we applied the Lanceotron^[125]13,[126]26 peak
   calling method to both data sets (total aggregated scATAC-seq and bulk
   ATAC-seq without downsampling). Both bulk ATAC-seq and scATAC-seq peaks
   were annotated based on genomic locations to determine if certain
   classes of elements, including promoters and distal regulatory
   elements, were detected differentially by the two technologies
   (Fig. [127]2). The most common class detected in the bulk ATAC-seq
   peaks are promoters (with a transcription start site (TSS) distance is
   less than 1 kb), while in scATAC-seq there is a large increase in
   distal intergenic and intronic peaks (Fig. [128]2A). Secondly, from a
   total of 71,313 peaks called, 57,586 peaks are only detected by
   scATAC-seq data, 47 peaks are unique to bulk ATAC-seq data and 13,680
   overlap between both data sets (Fig. [129]2B). Therefore, almost every
   bulk ATAC-seq peak is also found in the scATAC-seq set. This supports
   the idea that single-cell ATAC-seq may simply be more sensitive than
   bulk ATAC-seq on a per cell basis, while the enrichment for promoter
   distal peaks would suggest that it is more effective in detecting
   enhancer and/or CTCF sites than bulk ATAC-seq.

   Using our ChIP-seq based annotation showed that only a very small
   proportion of called peaks in bulk ATAC-seq are classified as CTCF
   sites, with the majority of peaks epigenetically marked as either
   enhancers or promoter classes (Fig. [130]2C). However, the same
   analysis in scATAC-seq peaks shows a very large increase in the
   detection of CTCF sites and, to a lesser degree, weakly marked active
   enhancer-like sites, suggestive of primed enhancers (Fig. [131]2D).
   This shows that bulk ATAC-seq has a much lower sensitivity than
   scATAC-seq for the detection of weak open chromatin sites, such as CTCF
   sites and weaker active enhancer-like elements (Supplementary
   Fig. [132]2). A very small number (N = 47) of enhancer and promoter
   peaks are unique to bulk ATAC-seq data. Manual inspection of these
   peaks showed them mostly to be due to minor variations in the peak
   callers’ behaviour (Supplementary Fig. [133]3) on the two datasets.

   To quantify more objectively the difference in sensitivity between bulk
   ATAC-seq and scATAC-seq, we used annotated CTCF bound regions derived
   from the ChIP-seq analysis for these cells, devoid of overlapping
   enhancer or promoter chromatin marks. These sites represent the
   classical CTCF-bound sites that are not associated with chromatin marks
   and are typically associated with weak open chromatin peaks, making
   them more challenging to detect in ATAC-seq data. Of the 22,072 such
   classical CTCF sites, 17,387 (78.77%) were detectable by calling peaks
   from the aggregated scATAC-seq signal from all 4,485 cells. In
   comparison, only 1,054 (4.77%) of these sites were detectable when peak
   calling the bulk ATAC-seq sample derived from ~ 80,000 cells. This
   increased sensitivity is still seen after subsampling the scATAC-seq
   data to a comparable read depth with the bulk data, which detected
   5,029 of the 22,072 CTCF sites (22.78%) (Fig. [134]2E), in agreement
   with the increased signal seen over CTCF sites in scATAC-seq
   (Fig. [135]1F).

   In summary, the analysis of matched bulk ATAC-seq and scATAC-seq showed
   that, after pseudo-bulking, scATAC-seq data contained similar data but
   of superior sensitivity per cell of input than bulk ATAC-seq. This is
   important as it means it can therefore be used similarly to bulk
   ATAC-seq to annotate regulatory elements but at greater sensitivity at
   much lower cell numbers. This, therefore, opens the potential to use
   scATAC-seq data to annotate the epigenetic landscapes of multiple cell
   types simultaneously at high sensitivity in heterogeneous cell
   populations.

   To confirm the generality of this conclusion, we made use of the
   publicly available scATAC-seq dataset from peripheral blood mononuclear
   cells and utilized previously generated bulk ATAC-seq data from Natural
   Killer cells (NK), which have a comparable number of cells to the large
   cluster in our erythroid dataset after merging (Supplementary
   Fig. [136]4A). This shows that a cluster of approximately four thousand
   cells produces a larger number of peak calls (58,753 vs. 39,778) for
   scATAC-seq compared to bulk ATAC-seq generated from 80,000 cells, with
   increased detection of CTCF bound sites (Supplementary Fig. [137]4B and
   C). However, due to the reuse of public data the bulk ATAC-seq and
   scATAC-seq are from different donors which likely explains the
   decreased overlap in the common peaks (39,778 peaks in total bulk
   ATAC-seq with 38,465 common between bulk ATAC-seq and scATAC-seq).

scATAC-seq can detect differences in a presumed homogenous population

   Identifying clusters of similar cells within complex heterogeneous
   populations is an essential but computationally intricate step in
   scATAC-seq analysis. Sparse data from each cell can be clustered to
   reveal distinguishable cell types and their epigenetic profiles. Bulk
   ATAC-seq of purified primary cells often assumes that they represent a
   homogenous population. scATAC-seq can detect differences in such
   “homogenous” populations. This can be seen in the dimensionality
   reduction and clustering analysis performed by Truch et al.^[138]11.
   The cells derived from the ex vivo differentiation of CD34+ stem cells
   have been considered a homogenous population of late erythroblasts,
   where the only expected differences would be related to cells in
   different stages of the cell cycle. However, the data analysis for this
   cultured sample clearly showed two distinct clusters of cells
   (Fig. [139]3A). To determine the reason for this, these two clusters
   were analysed further as individual cell types.

   After applying peak calling on each cluster, peak sets were annotated
   based on where those peaks lie within the genome. This analysis showed
   a profound loss of peaks lying distal to TSSs in the smaller cluster
   (Fig. [140]3B). Gene ontology (GO) and pathway enrichment analysis
   using Kyoto Encyclopedia of Genes and Genomes (KEGG)^[141]16 was
   performed to understand the biological reason why the distribution of
   peaks in the small cluster might have changed drastically (Fig. [142]3C
   and Supplementary Fig. [143]5). The analysis showed a strong
   up-regulation of processes such as autophagy, macroautophagy,
   catabolism and chromatin modification (Fig. [144]3C and for full
   pathway analysis see Supplementary Fig. [145]5). These are all key
   processes in pyknosis that precedes the expulsion of the nucleus and
   mitochondria during red cell maturation^[146]27. Additionally, open
   chromatin profiles have previously been generated for the preceding
   stages of erythroid differentiation which would be the only other
   likely contaminating population in erythroid differentiation, and these
   do not match the profile of the small cluster even though it retains
   erythroid specific signals such as the enhancer regions of the alpha
   globin cluster (Supplementary Figs. 6 and 7). Therefore, this strongly
   suggests that this cluster represents cells derived from the larger
   cluster and represents a later erythroblast stage of differentiation,
   immediately prior to enucleation^[147]28.

   This analysis showed that scATAC-seq not only recapitulates the same
   signal architecture as bulk ATAC-seq but also reveals cellular
   heterogeneity among and within presumed homogenous populations and
   allows for their separate analysis after pseudo-bulking of the detected
   clusters. The obvious difference in cell number between the large and
   small clusters highlights another important question in scATAC-seq: how
   many cells are enough to generate interpretable data equivalent to
   current bulk data?

How many cells are required for an accurate genome annotation after
pseudo-bulking?

   To evaluate how many cells are required to obtain meaningful chromatin
   signatures after pseudo-bulking the scATAC-seq data (late erythroblast)
   was downsampled from 4,485 cells to 2000, 1000, 500, 100 and 50 cells.
   A visual assessment shows that data quality is relatively robust, down
   to 500 cells (Fig. [148]4A). At 100 cells, although the active regions
   are clearly discernible, the structure of the data is less well
   defined. While the extraction of useful data is still possible, it is
   more computationally challenging, and large losses in sensitivity are
   likely. To assess this globally, Fig. [149]4B shows the total number of
   peaks in each downsampled dataset and the number of peaks that overlap
   the original peaks obtained from analysing 4,485 cells (Fig. [150]4B).
   As the sensitivity diminishes, the number of overlaps between the
   downsampled data and the unsampled data clearly decreases. As it was
   seen that scATAC-seq and bulk ATAC-seq had differential sensitivity for
   detecting different classes of genomics elements, particularly CTCF
   sites, we assessed which classes were most sensitive to decreasing cell
   numbers.

   Peaks from scATAC-seq erythroblast downsampled data containing 500
   cells were categorised as a promoter, enhancer or CTCF using the
   ChIP-seq data to observe what type of regulatory element peaks were
   lost when sensitivity diminished. When the peak annotation results in
   Fig. [151]4C are compared with Fig. [152]2D, the majority of the peaks
   lost in the downsampled peak sets are CTCF sites. Thus, by reducing
   cell number and therefore, sensitivity of scATAC-seq data, regulatory
   elements like CTCF sites with weaker open chromatin signatures are more
   likely to be lost.

   It would therefore be of great use for the practical application of
   scATAC-seq data to know approximately at what number of cells would the
   quality be equivalent to standard bulk ATAC-seq. Therefore, we again
   used CTCF-bound elements as an internal reference and progressively
   subsampled the scATAC-seq data until it was approximately as sensitive
   as the bulk data to the detection of these weak elements. We found that
   the aggregated scATAC-seq data from 200 cells detected a very analogous
   number of CTCF peaks to the bulk data (1,469 vs. 1,054 Fig. [153]4D).
   We also performed the same analysis for peaks annotated by chromatin
   marks as either enhancer and promoters and also showed approximately
   200 cells gave a similar sensitivity to the bulk data (Supplementary
   Fig. 8). This suggests by these metrics that a cluster of around 200
   cells in scATAC-seq would be of comparable sensitivity to standard bulk
   ATAC-seq.

How many cells are required to form clusters robustly?

   In the previous analysis, downsampling the total aggregated data was
   used to analyse the effect of cell number on data quality. However, in
   practice, we would analyse data from specific clusters identified
   within the data. It is therefore important to know not only that a
   cluster of a given number of cells will yield usable data, but also
   that a cluster of this size can be robustly identified during the
   dimensionality reduction and clustering steps to allow its subsequent
   aggregation and analysis. Therefore, the effect of the total number of
   cells on the formation of distinct clusters within the data was
   investigated to observe what diminished cell count accurate clustering
   would be impacted.

   To do this, we chose to spike in data from cells of a highly defined
   cell population erythroblasts) into data from a highly heterogeneous
   population (PBMCs N = 3,000 cells) at different levels. Both data sets
   were derived from the 10X platform, and erythroblasts do not tend to
   exist in PBMC isolates. The aim was to determine the lowest number of
   erythroblast cells still capable of forming a distinct and identifiable
   cluster in the context of a highly complex sample. We chose to extract
   cell data from the large cluster shown in Fig. [154]3A to represent a
   homogenous erythroblast population and then spiked in 300, 150, 80, 40
   or 20 cells into the PBMC data. Each spike-in dataset then went through
   the same analytical process with the same parameters using
   ArchR^[155]18. In this analysis, the spiked in erythroid cells are
   easily identified from the PBMC by tracking their unique cell IDs (CB
   tags). Clusters in the PBMCs were annotated using the gene expression
   data component of the multiome using a publicly available CITE-seq
   based reference database of PBMCs within the Azimuth platform^[156]19.

   The scATAC-seq component of these multiome datasets was analysed
   independently using ArchR^[157]18, and then clusters were annotated via
   Azimuth^[158]19. Annotated monocyte, B cells, CD4 T cells, CD8 naive,
   CD8 T other, NK and pDC clusters were isolated to form distinct
   homogeneous clusters to avoid ambiguity in the identity of cells within
   each cluster. Since nucleated erythroid cells have very distinct
   epigenetic landscapes from these immune cell types, we hypothesised
   that they would form a unique cluster from the pre-existing cells.

   Figure [159]5 clearly shows that the added erythroblast cells indeed
   formed a distinct cluster and were highly robust in terms of cell
   number. With 300, 150, 80 and 40 cells, the erythroblast cells were
   clustered together and distinct from all other cell types. However,
   with 20 erythroblast cells, the clustering algorithm failed to cluster
   them separately, with the algorithm instead assigning them to multiple
   clusters (Monocytes, NK cells, CD4 T cells and CD8 naive cells
   clusters).

   Since erythroblast has very distinct cell populations compared to cell
   types in the immune niche, we also performed the same in silico
   experiment using the same procedure and parameters on B cells
   (Supplementary Fig. 9) and CD8 naïve cells (Supplementary Fig. 10) to
   test if the clustering was similarly robust when using cells which have
   more similar epigenetic landscape to PBMC cells used as a background.
   We therefore spiked in 198, 150, 80, 40 or 20 B and CD8 naïve cells
   into the PBMC data. Similar to the erythroid cell cluster used in the
   previous analysis these in silico dilution experiments showed that
   distinct clusters can be identified robustly with even 40 cells showing
   that these results are generalisable to different cell populations with
   epigenetic landscapes less distinct from the cellular background than
   erythroid cells. This is important as 40 cells is much lower than the
   number of cells we determined necessary to annotate a cell cluster’s
   epigenome via pseudo-bulking; therefore, it would suggest that if a
   cell cluster has sufficient cell numbers to form an informative ATAC
   track, then we would generally also have sufficient cell numbers to be
   identified as a coherent cluster.

Discussion

   Since bulk ATAC-seq only provides an average measurement of chromatin
   accessibility profiles from a population of cells, the emergence of
   single-cell ATAC-seq technology has enabled us to accurately map
   specific groups of cells within relatively small populations^[160]29.
   Importantly, this study shows that results obtained by pseudo-bulking
   single cell data have the same signal architecture as bulk
   ATAC-seq. Additionally, scATAC-seq provides better sensitivity for the
   detection of weak open chromatin sites such as those associated with
   CTCF-bound regions and primed enhancers. In combination with the new
   and effective approaches to cluster cells with the same chromatin
   accessibility profiles, scATAC-seq produces highly detailed and
   cell-type specific regulatory maps from heterogeneous mixtures of
   cells.

   Our comparison of the fragment distribution showed that the common
   tagmentation step in both technologies produces evident nucleosomal and
   sub-nucleosomal profiles both within and outside of open chromatin
   peaks, although the degree of nucleosomal and sub-nucleosomal
   contributions differ. However, this still results in very similar
   relative peak strengths within each assay such that the order of
   largest to smallest peaks remains the same in both
   assays (Supplementary Fig. [161]1A). This is obviously critical for the
   use of aggregated scATAC-seq data in place of bulk ATAC-seq. However,
   our analysis shows that while the degree of general genomic background
   is relatively similar in both bulk and scATAC technologies, there is
   increased signal-to-noise over active regions, in particular CTCF
   elements and enhancers (Fig. [162]1E and F). The underlying reasons for
   this are not fully understood. It may be due to the difference in the
   tagmentation profiles within active regions. This shows an elevated
   contribution from sub-nucleosomal fragments in scATAC-seq which may
   represent protection by transcription factors bound to these regions.
   We also see an elevated contribution from mono-nucleosomal fragments,
   which are potentially derived from flanking modified nucleosomes
   (Fig. [163]1D).

   Another important finding of this study is that the scATAC-seq assay
   does not require large numbers of cells to produce informative and
   useful regulatory data. This study shows that a scATAC-seq cluster with
   ~ 200 cells provide data quality equivalent to bulk ATAC-seq data.
   While data sensitivity decreases with cell number, surprisingly 100
   cells from scATAC-seq data still produce interpretable data.
   Importantly, current clustering approaches can effectively identify
   clusters with fewer cells than this (~ 40 cells), so this step does not
   practically limit its use in rare subpopulations in heterogeneous
   mixtures.

Conclusions

   The work presented here shows that scATAC-seq, rather than being
   sparse, is rich in epigenetic information at a per cell level. This
   emphasises the potential of these technologies for both the
   simultaneous discovery of rare but biologically important cell types,
   such as stem and multipotent progenitors, and the use of their
   regulatory landscapes to better understand their biology and non-coding
   genetics Overall, this study comprehensively compares bulk ATAC-seq and
   scATAC-seq on the same material for data quality and resolution,
   generating chromatin accessibility profiles, and discernment of cell
   populations. It provides more detailed insight into the usage of
   scATAC-seq data.

Electronic supplementary material

   Below is the link to the electronic supplementary material.
   [164]Supplementary Material 1^ (8.4MB, docx)

Acknowledgements