Abstract

   Cataloging the diverse cellular architecture of the primate brain is
   crucial for understanding cognition, behavior, and disease in humans.
   Here, we generated a brain-wide single-cell multimodal molecular atlas
   of the rhesus macaque brain. Together, we profiled 2.58 M
   transcriptomes and 1.59 M epigenomes from single nuclei sampled from 30
   regions across the adult brain. Cell composition differed extensively
   across the brain, revealing cellular signatures of region-specific
   functions. We also identified 1.19 M candidate regulatory elements,
   many previously unidentified, allowing us to explore the landscape of
   cis-regulatory grammar and neurological disease risk in a cell
   type–specific manner. Altogether, this multi-omic atlas provides an
   open resource for investigating the evolution of the human brain and
   identifying novel targets for disease interventions.
     __________________________________________________________________

   Single-cell transcriptomic and epigenomic atlas resolves cellular
   complexity and disease risk across the adult macaque brain.

INTRODUCTION

   The cellular and molecular origins of complex human thought and
   behavior remain largely a mystery. Historically, proposed explanations
   have centered on the substantial relative size ([54]1–[55]3),
   prodigious number of cells ([56]4), or the large cortical surface area
   and thickness ([57]5) of the human brain. These explanations in
   isolation, however, fail to explain the many uniquely human faculties,
   nor do they explain the extreme variety and complexity of impairments
   that accompany human neurodevelopmental, neuropsychiatric, and
   neurodegenerative disorders ([58]6). The human brain is composed of
   myriad cell types, and this cellular heterogeneity presumably
   contributes to our cognitive and behavioral complexity ([59]7, [60]8).
   Supporting this hypothesis is the observation that the number of
   distinct cell types in the brain is positively correlated with
   behavioral complexity across vertebrates ([61]9). In recent decades, it
   has been proposed that certain aspects of human cognition are supported
   by specific cell types such as von Economo neurons ([62]10) and “mirror
   neurons” ([63]11), which have been hypothesized to support social
   intuition and empathy, respectively. These propositions, however,
   remain largely untested due to gaps in our understanding of the
   cellular landscape of the human brain and, crucially, differences in
   cell type composition and regional heterogeneity among the brains of
   humans, nonhuman primates, and other animals.

   In recent years, the application of rapidly developing single-cell
   technologies to the brain has begun to address these gaps. Single-cell
   molecular surveys of targeted regions of the mouse and human brain, for
   example, have revealed specialized species-specific cell types—e.g.,
   rosehip neurons in humans ([64]12)—and regional biases in cell type
   distribution and function [e.g., ([65]13)]. Such atlases are yielding
   unprecedented cross-species insights into the cellular architecture
   supporting the structure and function of the brain ([66]14, [67]15),
   but the general paucity of comparative nonhuman primate brain atlases
   has left a conspicuous gap ([68]16). Moreover, much effort has focused
   on single molecular modalities (e.g., transcriptomics), typically in
   only one or a few regions, leaving a lacuna in our understanding of the
   molecular mechanisms underlying cell function across much of the
   primate brain.

   Here, we generated a 4.2 million cell (combined) transcriptomic and
   epigenomic atlas across the brain of the rhesus macaque (Macaca
   mulatta), the most widely used nonhuman primate model organism for
   studies of human perception, cognition, aging, and neurological disease
   ([69]17). These single-cell profiles derive from 30 distinct brain
   regions that collectively represent major cortical, subcortical, and
   cerebellar areas involved in sensory, cognitive, emotional, and motor
   functions. Many of these regions are also implicated in one or more
   clinically relevant neurological disorders. By integrating measures of
   gene expression and chromatin accessibility, we discover molecular
   signatures that define cell types across the macaque brain,
   characterize their distribution and molecular function across disparate
   anatomical regions, and nominate sets of cis-regulatory regions that
   likely contribute to mature cell fate and function across the brain.

RESULTS

A molecular taxonomy of cell types across the primate brain

   We generated single-nucleus RNA sequencing (snRNA-seq) data from 30
   distinct regions across the cortex, subcortex, cerebellum, and
   brainstem (N = 5 animals; three females) using three-level single-cell
   combinatorial indexing RNA-seq (sci-RNA-seq3) ([70]Fig. 1A and table
   S1) ([71]18, [72]19). With the original sci-RNA-seq3 protocol ([73]20),
   we generated 1,008,204 single-nucleus transcriptomes from 110 age-,
   sex-, and hemisphere-matched samples representing 28 brain regions of
   10 year-old (mid-adult aged) macaques (N = 3 animals; two females).
   Over the course of the study, we implemented improvements in nuclei
   isolation and preservation ([74]21), which increased nuclear
   transcriptome recovery by ~60% [median unique molecular indices (UMIs),
   before = 202, after = 320] and, consequently, the number of nuclei
   passing our UMI threshold. With the improved protocol, we generated an
   additional 1,702,081 single-nucleus transcriptomes from the right
   hemisphere of two animals, the vast majority (N = 1,579,908) of which
   were sampled from 27 brain regions of a single 10-year-old female
   macaque. Altogether, after applying quality control (QC) filters
   (Materials and Methods and figs. S1 and S2), we recovered transcriptome
   profiles for 2,583,967 nuclei (median UMI per cell = 265, median genes
   expressed per cell = 221; table S2).

Fig. 1. Experimental setup and summary of the Macaque Brain Atlas snRNA-seq
dataset.

   [75]Fig. 1.
   [76]Open in a new tab

   (A) Schematic of biopsied brain regions for sci-RNA-seq3 experiment. A
   full list of sampled regions is provided in table S1. arc, arcuate
   sulcus; cgs, cingulate sulcus; cs, central sulcus; ecal, external
   calcarine sulcus; iarc, inferior arcuate sulcus; ic, internal capsule;
   ips, intraparietal sulcus; ls, lateral sulcus; lv, lateral ventricle;
   p, principal sulcus; rf, rhinal fissure; sarc, superior arcuate sulcus.
   (B) UMAP visualization of all snRNA-seq profiled cells colored by cell
   type [with color code shown in (C)]. (C) Barplots showing the log[2]
   -transformed cell counts (left), regional specificity score (middle),
   and regional composition [right, with color code shown in (E)] of each
   cell type. (D) UMAP visualization of all snRNA-seq cells colored by
   cell type [with color code shown in (E)]. (E) Barplots showing the cell
   type composition [left, with color code shown in (C)],
   log[2]-transformed ratio of glutamatergic neurons and GABAergic neurons
   (middle), and log[2]-transformed ratio of neurons and glial cells
   (right) of each region. Regions are organized by the regional subclass
   to which they belong.

   Controlling for batch effects across sequencing runs (Materials and
   Methods and fig. S4), we jointly clustered single-cell profiles across
   all sampled brain regions to identify 17 molecularly distinct cell
   types, which we refer to as “cell classes” ([77]Fig. 1, B and C). On
   the basis of established cell markers (fig. S5 and table S3), we
   annotated these 17 cell classes as either: (i) neuronal cells,
   including cortical glutamatergic neurons (CAMK2A), cortical GABAergic
   neurons (GAD1 and GAD2), basket cells (GRID2 and SORCS3), other
   cerebellar neurons (primarily granule cells; GRM4), medium spiny
   neurons (DACH1, PPP1R1B, and BCL11B), serotonergic neurons (TPH2),
   dopaminergic neurons (TH and DBH); or (ii) non-neuronal cells,
   including microglia (DOCK2), oligodendrocyte precursor cells (OPCs;
   VCAN), astrocytes (ALDH1A1 and GFAP), oligodendrocytes (MOG and MBP),
   vascular cells (CFH), and ependymal cells (FOXJ1). Our broad survey
   also captured four rare cell populations that, to our knowledge, have
   not yet been identified in other studies: three RBFOX3^+ (NeuN^+)
   neuron-like populations (marker genes: APOA2, N = 7055 cells; F5, N =
   880; KIR2DL1/2, N = 84) and one RBFOX3^− microglia-like population
   (marker gene: KIR3DL1/2^+, N = 44 cells, also
   P2RY12^+/PTPRC^+/ENTPD1^+). Given their rarity, we removed these four
   cell populations from downstream analyses. Hierarchical clustering of
   cell classes by the top 50 principal components of gene expression
   largely recapitulated broad ontogenetic relationships, with most
   neuronal classes clustering together (dopaminergic neurons being the
   exception) and the two mesoderm-derived classes (microglia and vascular
   cells) clustering together (fig. S6A). Further, analysis of pathways
   associated with gene expression of each cell class revealed known
   molecular physiological processes characteristic of cell classes
   (tables S4 and S5). For example, we found that gene expression in
   glutamatergic and GABAergic neurons were associated with synapse
   assembly/function, oligodendrocyte gene expression was associated with
   myelination, and microglia gene expression was associated with immune
   processes including inflammation. Dopaminergic and serotonergic neurons
   had high activation levels of expected pathways such as tyrosine
   hydroxylase/catecholamine and serotonin/melatonin biosynthesis,
   respectively.

   By sampling across a broad range of anatomical regions within the same
   individuals, we were able to characterize cellular composition across
   30 distinct brain regions—to our knowledge, the most regionally
   expansive nonhuman primate single-cell brain atlas to date ([78]Fig. 1,
   D and E). The distribution of major cell classes was balanced between
   sexes and hemispheres (fig. S7) but differed extensively across
   regions, reflecting the cellular makeup underlying region-specific
   functions ([79]Fig. 1E). Unsupervised hierarchical clustering of brain
   regions according to cell class composition for the most part conformed
   to broader anatomical categorizations, with regions of the cortex,
   subcortex, brainstem, and cerebellum usually grouping together (fig.
   S6B), which was also the case when clustering regions based on the top
   50 principal components of gene expression (fig. S6B). Two of these
   four broad regional classes were composed primarily of a single cell
   class: In the cortex (N = 16 regions, table S6), glutamatergic neurons
   were the most abundant cell type (mean = 63.7% of all cells per sample)
   and outnumbered GABAergic neurons by almost fourfold ([80]Fig. 1E; mean
   = 17.4%), while the cerebellum (N = 2 regions) was composed almost
   entirely of cerebellar neurons (mean = 85.1%). In contrast, the
   subcortex (N = 8 regions) and brainstem (N = 4 regions) were more
   heterogeneous with respect to their cellular composition, with samples
   from these regions containing roughly equal proportions of
   glutamatergic neurons (mean[subcortex] = 25.1%; mean[brainstem] =
   25.5%), GABAergic neurons (mean[subcortex] = 20.2%; mean[brainstem] =
   23.0%), and oligodendrocytes (mean[subcortex] = 18.5%; mean[brainstem]
   = 25.5%). We further subdivided the cortical and subcortical samples
   into “region subclasses” based on neuroanatomical groups (table S1),
   within which variation in cellular composition was more limited
   ([81]Fig. 1E). For instance, in the subcortex, medium spiny neurons
   (MSNs) comprised around half of the cells in the basal ganglia [nucleus
   accumbens (NAc) mean = 44.7%; caudate nucleus (CN) mean = 60.0% MSNs],
   while the thalamus was enriched for GABAergic neurons [lateral
   geniculate nucleus (LGN) mean = 55.7%; mediodorsal thalamic nucleus
   (mdTN) mean = 43.8%; ventrolateral thalamic nucleus (vlTN) mean =
   28.6%].

   Our broad survey also captured two rarer, but important, cell classes:
   dopaminergic and serotonergic neurons. These two classes of neurons
   collectively represented less than 0.3% of all profiled cells
   (dopaminergic = 0.14%; serotonergic = 0.12% of all cells) and 0.5% of
   all neurons (dopaminergic = 0.19%; serotonergic = 0.17% of all
   neurons), suggesting that targeted approaches that enrich for these
   cells [e.g., ([82]22, [83]23)] are necessary to identify
   transcriptional variation among subtypes. Dopaminergic neurons, which
   are found primarily in the substantia nigra pars compacta at low
   frequency [1.1% of cells sampled in the midbrain (MB) versus mean 0.1%
   in other sampled regions], are involved in a range of important
   processes, including voluntary movement, reinforcement learning, and
   addiction, and their loss is a neuropathological hallmark of
   Parkinson’s disease ([84]24). We found that serotonergic neurons were
   most abundant in the brainstem (mean 0.35% in the four brainstem
   regions versus mean 0.09% in other sampled regions), where they play a
   major role in sleep, mood, and appetite, and are key targets of
   pharmacological therapies for major depressive disorder in humans
   ([85]25).

Regional variation in cell subtype composition

   To characterize heterogeneity within cell classes, we partitioned the
   dataset and repeated preprocessing and clustering separately for each
   of the 17 cell classes. Collectively, we identified 112 distinct
   clusters (fig. S8 and table S7), using a coarser clustering criterion,
   and 397 distinct clusters (table S8) using a more fine-grained
   criterion, which captured neuronal and non-neuronal diversity across
   the primate brain ([86]Fig. 2A). We refer to clusters at the coarser
   level as “cell subtypes” and clusters at the finer-grained level as
   “cell subclusters.” We identified extensive heterogeneity in
   glutamatergic (39 subtypes) and GABAergic (20 subtypes) neurons
   primarily found in the cortex and some regions of the subcortex [e.g.,
   hippocampus (HIP) and thalamus], while neurons derived from other
   noncortical brain regions (e.g., cerebellum and striatum) were
   transcriptionally distinct and relatively homogeneous within those
   regions ([87]Fig. 2A). This is due in part to the large number of
   specialized neurons present in some of these regions, including granule
   and Purkinje cells in the cerebellum, and medium spiny neurons in the
   basal ganglia (table S7).

Fig. 2. Cell subtype distribution and variation across the brain.

   [88]Fig. 2.
   [89]Open in a new tab

   (A) Barplots showing the region specificity score (i.e., Jensen-Shannon
   divergence statistic) and composition for cell subtypes (with color
   code shown in [90]Fig. 1E). (B) Heatmap showing scaled log[2] ratios of
   GABAergic neuron and astrocyte subtype compositions within cortical
   region, compared to the average across all regions. Cell subtypes with
   at least 100 cells profiled are shown in the order of abundance (x
   axis, left to right) in the cortical regions organized by region
   subclasses (y axis). The color and direction of each pie correspond to
   relative enrichment (blue, clockwise) and depletion (red,
   anticlockwise) of a cell subtype in a region. Log[2] ratios were capped
   at positive and negative 2 before scaling. (C) UMAP visualizations of
   GABAergic neurons colored by cell subtype (left) and regional subclass
   (right). (D) UMAP visualizations of GABAergic neurons colored by cell
   subtype marker gene expression. (E) UMAP visualization of astrocytes
   colored by the region with the highest lochNESS, indicating enrichment
   of a region subclass in the cell’s transcriptional vicinity. LochNESS
   distribution in a few example regions (occipital lobe, basal ganglia,
   brainstem, thalamus, and frontal lobe) is highlighted in separate
   panels as examples. (F) UMAP visualizations of astrocytes colored by
   lochNESS-derived region-related marker genes.

   Our systematic approach also allowed us to characterize and compare the
   regional cellular distribution of non-neuronal subtypes, including
   those of glia, which have not often been the focus on most single-cell
   atlases to date ([91]Fig. 2A and table S9). Overall, we identified six
   astrocyte, two microglial, seven oligodendrocyte, and six vascular cell
   subtypes, the latter including endothelial cells, smooth muscle cells,
   pericytes, and both perivascular and meningeal fibroblasts (fig. S8 and
   table S7) ([92]26). We compared cell subtypes to published datasets
   using a non-negative least squares (NNLS) approach ([93]19) and found
   broad correspondence with subtypes observed in human cortical ([94]14),
   human brain vascular ([95]26), and macaque hippocampal atlases (fig.
   S10, A to E) ([96]27).

   To identify cell subtypes or subclusters that were specific or biased
   toward a single region or set of regions, we calculated a measure of
   “regional specificity” using the Jensen-Shannon divergence statistic
   (Materials and Methods) ([97]28, [98]29). Overall, glial subtypes were
   more evenly distributed across all regions compared to neuronal
   subtypes ([99]Fig. 2A). This is reflected in lower Jensen-Shannon
   specificity scores for glial subtypes (mean = 0.20; median = 0.15;
   range = [0.04,0.81]) compared to cortical neurons (mean = 0.31; median
   = 0.18; range = [0.08,0.89]). A number of cell subtypes, both neuronal
   and non-neuronal, were highly region specific. For instance,
   oligodendrocyte subtype 8, the rarest oligodendrocyte subtype (N = 3439
   cells; 1.5% of oligodendrocytes) overwhelmingly derived from the highly
   myelinated corpus callosum (CC; 93.0% of these cells; [100]Fig. 2A and
   table S9). Among cortical neurons, GABAergic interneuron subtypes
   generally exhibited a lower median regional specificities than to
   glutamatergic neuron subtypes, although there were a number of
   interneuron subtypes specific to the thalamus (cluster 6) or brainstem
   (clusters 3 and 16), discussed below.

   Given that the regional specificity of excitatory neuronal subtypes has
   been explored in depth in other studies [e.g., ([101]30, [102]31)], we
   focus here instead on populations that are vital for neuronal signal
   transduction but for which cellular diversity has not previously been
   explored across the macaque brain. Specifically, we concentrated on the
   regional diversity of interneurons, because they are important
   components of long-range circuitry and have been characterized in a few
   regions across mice, monkeys, and humans [e.g., ([103]12, [104]14,
   [105]15, [106]32)], allowing us to both benchmark our atlas and extend
   current knowledge to understudied regions. We also examine regional
   distribution among astrocytes, which are crucial for maintaining
   neuronal homeostasis ([107]33) and are implicated in neurological
   disorders ([108]34) but have been relatively understudied at the
   single-cell level.

   We pursued three main approaches to dissect the regional heterogeneity
   within interneuron and astrocyte subtypes, discussed in further detail
   below: (i) quantification of cell subtype composition to identify
   nuanced differences in detailed regions within the cortex; (ii)
   identification of regionally specific gene expression programs by
   analyzing region specific subtypes of interneurons; and (iii) in the
   case of minimal region specific subtypes, leveraging a recently
   developed statistic to identify region-specific gene expression
   patterns in astrocyte subtypes in a cell subtype–agnostic fashion.

   Within specific regions of the cortex, cell subtype composition
   differences become more subtle and require focused quantification. As a
   first approach, for every sufficiently abundant interneuron and
   astrocyte subtype in the cortex (>100 cells), we calculated the
   log[2]-transformed ratio of cell subtype composition in a region,
   compared to the average composition of that subtype across all cortical
   regions ([109]Fig. 2B). Within the five most abundant interneuron
   subtypes, we not only note general balance across all cortical regions
   but also observe a relative enrichment of cluster 2 (PVALB^+) in the
   occipital lobe [primary visual cortex (V1)] with depletion in regions
   within the temporal lobe and depletion of cluster 5 (ADARB/PAX6^+) in
   V1. In the superior temporal sulcus (STS) and middle temporal visual
   area (MT), there is a strong depletion of astrocyte subtype 3
   (LUZP2/GPC5^+) but an enrichment of subtype 6 (KCNIP4/RBFOX1^+).

   Interneurons are the primary drivers of inhibitory control through the
   release of GABA (γ-aminobutyric acid) and thus strongly affect neural
   circuitry. Inappropriate development of GABAergic interneurons and
   subsequent loss of inhibitory regulation contribute to disorders of
   neurodevelopment, including epilepsy and autism ([110]35, [111]36).
   Despite their importance, the molecular identities and distribution of
   interneuron subtypes across the adult primate brain remain relatively
   unknown outside of a few regions ([112]15, [113]32, [114]37). Our
   snRNA-seq sample captured 371,548 GABAergic interneurons corresponding
   to 20 subtypes. As a second approach, we focused on gene markers of
   region-specific interneuron subtypes. Eleven interneuronal subtypes
   were primarily found in the cortex and could be assigned to four
   primary interneuron groups that are conserved between mouse and human
   brains ([115]32), marked by SST, PVALB, VIP, and LAMP5 expression (fig.
   S11). Compared to the cortex, the brainstem and thalamus had a unique
   distribution of interneuron subtypes ([116]Fig. 2C). Thalamic
   interneurons, which use feed-forward inhibition to relay and tune
   visual inputs to thalamocortical neurons, expressed high levels of
   NTNG1 and RNF220 ([117]Fig. 2D), which is indicative of long-range
   interneurons in the first-order relay nuclei of the thalamus ([118]38).
   Sampling across the striatum, which is a critical part of the reward
   pathway and the largest part of the basal ganglia, a recent single-cell
   study identified a molecularly unique primate interneuron ([119]32),
   which was most similar to our GABAergic cluster 18 and represented 15%
   of interneurons in the CN ([120]Fig. 2A).

   Astrocytes, the second most abundant non-neuronal cell type in our
   dataset, are multifaceted support cells of the brain that perform a
   variety of tasks related to neuronal homeostasis. These tasks can vary
   across brain regions ([121]33), and astrocyte dysfunction has been
   linked to neurological diseases, including Alzheimer’s disease
   ([122]39). Given these regional differences, we examined whether
   astrocyte subtypes exhibited regional biases in macaque, similar to
   what has been observed in the mouse brain ([123]40). However, while
   astrocyte subtypes were widely distributed across multiple regions, the
   cell clusters did not correspond neatly to regions of origin, making
   claims about inter-region differences in cell composition difficult to
   systematically analyze across the many regions profiled. To address
   this complexity, as a third approach, we adapted our recently developed
   statistic, lochNESS ([124]41), to quantitatively measure regional
   enrichment in each cell’s “neighborhood” of transcriptionally similar
   cells. Briefly, for each cell, we tally the number of cells from each
   brain region in its neighborhood and calculate a focal regional
   enrichment score (fig. S12A and Materials and Methods). We illustrate
   the utility of this approach by calculating the lochNESS score on
   astrocytes at the level of brain region subclasses. Each cell had 11
   lochNESS scores calculated, one for each region subclass, with each
   such score quantifying the enrichment of the given region subclass in a
   cell’s transcriptional vicinity. We then identified the most enriched
   region subclass in a cell’s neighborhood and examined the regional
   heterogeneity agnostic to the cluster-assigned subtype labels
   ([125]Fig. 2E). We also extended lochNESS to identify genes whose
   expression can be predicted by lochNESS scores for given regions. We
   modeled gene expression as a function of the lochNESS scores of each
   region in each cell using generalized linear regression (Materials and
   Methods and table S10). The resulting set of genes with significant
   positive associations with a region’s lochNESS score have higher
   expression in, and are putatively markers for, cell subtypes in that
   region.

   Using this approach, we identified markers for astrocytes in specific
   regions (e.g., TCAF2 and FRK in the occipital lobe) and in combinations
   of regions (e.g., PGD in the brainstem, basal ganglia, and thalamus),
   which we would not have identified if we focused solely on discrete,
   computationally defined clusters ([126]Fig. 2F). This strategy thus
   facilitates the identification of more complex region-specific gene
   expression patterns. For example, EMID1, which is a marker for a
   subpopulation of astrocyte-like NG2 cells ([127]42), is more highly
   expressed in astrocytes in the cortex but not in the thalamus,
   brainstem, or cerebellum. In contrast, ADAP2, which is involved in
   protection from RNA virus infections ([128]43), is highly specific to a
   subset of astrocytes found in the thalamus ([129]Fig. 2F). LochNESS can
   thus provide a more nuanced approach to identifying regionally biased
   cell subtypes and gene expression than conventional clustering. While
   we focused on astrocytes in this example, lochNESS could be iteratively
   applied to regions within a subclass in each cell class, e.g., for all
   glutamatergic neurons across all cortical regions or oligodendrocytes
   across all subcortical regions (fig. S12, B to D).

Joint analysis of single-nucleus transcriptomic and epigenomic data

   To complement our transcriptomic dataset and identify key regulatory
   genomic regions in brain cells, we applied three-level single-cell
   combinatorial indexing ATAC-seq (sci-ATAC-seq3) ([130]44, [131]45) to
   profile single-nucleus assay for transposase-accessible chromatin
   (ATAC) with sequencing (snATAC-seq) epigenomes from nearly all of the
   brain regions represented in our snRNA-seq dataset. To maximize
   comparability among datasets, we used 110 of the same age-, sex-, and
   hemisphere-matched tissue samples (representing the same three animals)
   profiled in our snRNA-seq dataset. To ensure that the snRNA-seq and
   snATAC-seq datasets captured the same heterogeneous populations of
   cells, we homogenized tissue samples on dry ice before separately
   preparing separate nuclei isolations for each library type (Materials
   and Methods). Together, the snATAC-seq samples represented 28 of the 30
   regions (n = 3 animals; MB and MT snATAC-seq data were not generated).
   After QC (Materials and Methods and fig. S13), the total number of
   nuclei profiled was 1,587,880 and ranged from 5100 [in the closed
   medulla (MdC)] to 114,410 [in the inferior temporal gyrus (IT)] nuclei
   per region (median = 63,739 nuclei per region). We called peaks on a
   per-sample basis and combined them across all samples based on genomic
   overlap, resulting in (after filtering) a combined set of 1,192,873
   candidate cis-regulatory elements (cCREs) spanning 24.4% (725 Mb) of
   the genome.

   We first applied uniform manifold approximation and projection (UMAP)
   dimensionality reduction and Leiden clustering to the batch-corrected
   epigenomic data ([132]Fig. 3A) and identified 42 clusters which, based
   on promoter accessibility, could be assigned to most major cell classes
   found across the brain ([133]Fig. 3B). However, given that unsupervised
   approaches to cell type identification are consistently more sensitive
   using single-cell RNA-seq or snRNA-seq data ([134]46), we drew from our
   transcriptionally defined cell annotations to assign cell labels to our
   snATAC-seq nuclei. To integrate the datasets, we used the graph-linked
   unified embedding (GLUE) approach ([135]47) and generated a unified
   transcriptomic and epigenomic embedding of 4,171,847 nuclei
   ([136]Fig. 3, C and D). Subsequent cell type predictions based on our
   multimodal integration assigned the majority of snATAC-seq nuclei to a
   cell class (73.7% with confidence ≥ 0.95; fig. S14) and captured all of
   the major cell classes ([137]Fig. 3D) with the exception of
   serotonergic and dopaminergic neurons, which are relatively rare and
   fairly specific to the MB (which as noted above was not sampled in our
   snATAC-seq data). The regional distribution of cell classes captured
   from snATAC-seq and snRNA-seq data were highly concordant, both within
   regions ([138]Fig. 3E) and overall ([139]Fig. 3F), which demonstrates
   that our homogenization and nuclei isolation protocols captured the
   same heterogeneous populations of cells in the same regions across both
   modalities.

Fig. 3. Generation of the Macaque Brain Atlas sci-ATAC-seq dataset and
identification of cell classes.

   [140]Fig. 3.
   [141]Open in a new tab

   (A) UMAP visualization of all snATAC-seq cells colored by brain region
   [with color code shown in (C)]. (B) UMAP visualizations of promoter
   accessibility scores of cell markers (GAD2: GABAergic neurons, ENTPD1:
   microglia, SLC1A2: astrocytes, ATP10A: vascular cells) reveal high
   specificity. (C) Barplots showing nuclei counts by brain region of the
   snRNA-seq, snATAC-seq, and integrated datasets. (D) UMAP visualizations
   of integrated multimodal data, with cell classes colored separately for
   (left) snRNA-seq and (right) snATAC-seq nuclei [with color code shown
   in (F)]. (E) Spearman’s rank correlation coefficients showing the
   correlation between cell class proportions in the snRNA-seq and
   snATAC-seq datasets within each region (representing data generated
   from the same homogenized sample). (F) Scatterplot showing the
   correlation between cell class proportions in the overall snRNA-seq and
   snATAC-seq datasets (combined across brain regions). (G)
   Integration-derived cell class annotations visualized over the same
   snATAC-seq UMAP visualization shown in (A) [with color code shown in
   (F)].

The gene regulatory landscape of the rhesus macaque brain

   We leveraged the snRNA-based cell class annotations ([142]Fig. 3G) to
   explore heterogeneity in cell type–specific gene regulation across the
   brain. To do so, we partitioned all unique snATAC-seq reads by
   predicted cell class ([143]Fig. 3G) and then called peaks separately
   for each partition using a similar peak calling approach to that used
   for the overall dataset, thereby generating an inventory of putative
   cCREs derived from each cell class in isolation (Materials and
   Methods). Across 11 cell classes with snATAC-seq–assigned nuclei, we
   identified an average of 210,572 peaks per cell class, ranging from
   99,323 in microglia to 425,738 in cortical GABAergic neurons
   ([144]Fig. 4A). On average, for any given cell class, these peaks
   covered 7.7% of the genome, and 28.8% were found >2 kb from the nearest
   gene or promoter ([145]Fig. 4A).

Fig. 4. Enrichment of TF binding site motifs in candidate regulatory
elements.

   [146]Fig. 4.
   [147]Open in a new tab

   (A) Barplots showing summary statistics for peak sets called separately
   on reads derived from cells assigned to each of 11 cell classes. (B)
   Heatmap showing enrichment (log[2] OR) of TF binding motifs among cell
   classes. The top five most-enriched nonredundant TF motifs (all P[adj]
   < 0.05) are shown per cell class, ordered from left to right by
   increasing P[adj]. Log[2] OR color ranges are capped at ±1.5. (C)
   Position weight matrices of the most-enriched TF motifs for six example
   cell classes. ORs are shown in parentheses. (D) Scatterplots showing
   correlation between snATAC-seq accessibility of TF binding motifs and
   snRNA-seq gene expression of corresponding TF genes within cell classes
   in regional classes for four example TFs.

Transcription factor regulatory networks

   Multimodal integration of cell-specific snATAC-seq and snRNA-seq data
   allowed us to examine the cis- and trans-regulatory links between
   chromatin accessibility and gene expression within individual cell
   types. We first examined putative trans-regulatory factors within cell
   classes and subtypes. Transcription factors (TFs) are key
   trans-regulatory proteins that control cell differentiation and
   function during neurodevelopment ([148]48–[149]51) and have been
   implicated in myriad neurodegenerative diseases ([150]52–[151]54). The
   extremely high cell type specificity of some nuclear TFs have also made
   them useful targets for identifying and enriching rarer cell types
   before single-cell sequencing ([152]55–[153]57).

   To identify candidate trans-acting regulatory networks in each cell
   class, we carried out TF binding motif enrichment analysis on each set
   of cell class–specific peaks, defined as the subset of a cell class’s
   cCREs that did not overlap with any peaks called in other cell classes
   (Materials and Methods and [154]Fig. 4A). Cell-class–specific cCREs
   were highly enriched for many TF binding motifs that are likely
   involved in cell-specific gene regulation ([155]Fig. 4B and table S11),
   including many motifs previously implicated ([156]Fig. 4C). For
   instance, microglial cCREs contained 6.6-fold more binding sites of the
   nuclear TF SPI1 (also known as PU.1) than expected by chance (P[adj] =
   1.22 × 10^−284; [157]Fig. 4, B and C). In addition to such canonical
   examples, we identified numerous motifs that distinguish relatively
   similar cell classes. For instance, the TF binding motif for NFE2, from
   the nuclear respiratory factor (NRF) TF family, was most enriched [odds
   ratio (OR) > 2] in cCREs in both medium spiny neurons (OR = 3.07,
   P[adj]= 1.77 × 10^−87) and basket cells (OR = 2.34, P[adj] = 8.76 ×
   10^−7), while the binding motif for NEUROD1 was most enriched (OR > 2)
   in cCREs of basket cells (OR = 2.04, P[adj] = 3.53 × 10^−29), where
   this TF is necessary for basket cell terminal differentiation and,
   consequently, axon growth and inhibitory circuit formation ([158]58).

   We also characterized TF binding motif enrichment at the cell subtype
   level. To do this, we extended our multimodal integration and
   label-transferring approach to each cell class independently by
   tabulating the reads per-cell falling within cell class–specific cCREs
   described above for all cells of a given cell class ([159]Fig. 3D). We
   then integrated the data with corresponding snRNA-seq data of the same
   cell class using GLUE (Materials and Methods). The resulting integrated
   embeddings for each cell class were then used as the basis for
   predicting cell subtypes, which we carried out on all snATAC-seq cells
   within each class (fig. S15).

   Since cell subtypes are preselected to already share broadly similar
   chromatin accessibility profiles, identifying peaks that are specific
   to a single subtype—similar to our approach at the cell class level—was
   not feasible and left most cell subtypes with no or very few unique
   peaks to analyze. As an alternative strategy, we carried out
   differential accessibility analyses among cell subtypes to identify
   peaks that were predictive of each individual cell subtype within a
   given cell class (Materials and Methods). We then identified TF binding
   motifs enriched in highly differentially accessible regions within cell
   subtypes (table S12). For example, we observed numerous TF binding
   motifs (N = 433, P[adj] < 0.05) that were enriched within highly
   accessible peaks in Purkinje cells, a GABAergic neuron type of the
   cerebellum that is implicated in autism spectrum disorders (ASDs). In
   our snRNA-seq dataset, of all tested diseases ([160]59), genes
   associated with autism (DOID:12849) were overrepresented (Fisher’s
   exact test, OR = 10.2, P[adj] = 8.52 × 10^−16) among the top 100
   Purkinje cell marker genes, including RORA [fold change (FC) = 331.9],
   AUTS2 (FC = 43.1), and SHANK2 (FC = 13.8) (table S13). Correspondingly,
   we found that TF motifs enriched in differentially accessible peaks
   included RORA, four members of the early growth response (EGR) family
   (EGR1 to EGR4), and CTCF (EGR1, EGR3, and CTCF were among the top five
   TF motifs ranked by OR; RORA ranked 182nd). RORA is a regulator of
   circadian rhythm that exhibits decreased expression in ASD brains and
   may play a role in ASD pathogenesis ([161]60, [162]61). EGR family TFs
   have been implicated in the disruption of human-specific developmental
   programs in autism ([163]62). CTCF is an insulator protein that
   regulates chromatin structure and may play a critical role in
   maintaining dendrite structure in Purkinje cells ([164]63) and CTCF is
   also a risk gene for ASD ([165]64).

   Given that families of TFs have similar binding motifs, it is often
   difficult to identify the specific TF in a given family that is
   responsible for enrichment in cell type–specific cCREs. To identify the
   most likely TF, we therefore used our recently developed approach
   ([166]45, [167]65) that uses the computationally paired snRNA-seq and
   snATAC-seq data. Briefly, this approach relies on the assumption that
   TFs will be highly expressed in cell types where they play a key role,
   while their associated motif should be enriched (or depleted) in that
   cell’s cCREs, indicating TF activation (or repression). Overall, we
   compared the accessibility of 369 TF binding motifs and their
   corresponding gene’s expression across the cell classes in four region
   subclasses, with 189 TFs showing positive Pearson’s correlation between
   gene expression and accessibility of the cognate motif and 180 showing
   negative correlation (fig. S16A and table S14). Among the TFs with
   largest positive or negative Pearson’s correlation values were strong
   cell class–specific activators and repressors ([168]Fig. 4D and fig.
   S16B). For instance, SPI1, which has been identified as a candidate
   gene for Alzheimer’s disease via various functional genetics approaches
   ([169]66), shows a strong activating effect with high expression of the
   SPI1 gene and high accessibility for the SPI1 binding motif in
   microglia. In contrast, NFATC2 has a repressing effect in microglia and
   vascular cells, as shown by high expression of the NFATC2 gene
   associated with lower NFATC2 motif binding in those cell types. We also
   found evidence for a clear distinction between neurons and non-neuronal
   cells at two TFs, with ELF1 functioning as a non-neuronal specific
   activator and NEUROD2 as a neuron-specific activator. In addition, we
   note that FLI1, an activator in vascular and microglia cell types, and
   ELF1 have motif sequences similar to SPI1 (fig. S16C), but their
   activating effects affect a broader set of cell types.

The cis-regulatory landscape of brain cell variation

   We next sought to characterize cis-regulatory interactions between
   cCREs and proximate genes in the rhesus macaque brain. We used two
   complementary analyses to scan for interactions using our integrated
   multimodal dataset. First, we used the regulatory inference framework
   of GLUE ([170]47), which leverages the unified feature embedding (i.e.,
   joint integration of snRNA-seq genes and snATAC-seq peaks in a common
   data space) generated during GLUE integration to assess similarity
   between peaks and genes. Putative regulatory interactions are defined
   as a high cosine similarity between peak and gene feature embeddings in
   the unified data space, with statistical significance assessed by
   permutation ([171]47). Second, we used a metacell-based approach to
   aggregate snRNA-seq transcriptomes and snATAC-seq epigenomes into
   multimodal metacells based on k-means clustering of the unified cell
   embeddings and then used logistic regression to model the relationship
   between gene expression and chromatin accessibility within a given
   metacell ([172]67). In contrast to the GLUE regulatory score, the
   logistic regression analysis enabled us to differentiate between
   positive and negative regulatory interactions between peaks and genes.
   We considered peak-gene pairs to be putatively regulatory if P[adj] <
   0.05 for both analyses ([173]Fig. 5A and fig. S17). For each cell
   class, we also scanned for differentially accessible peaks using both a
   regularized logistic regression and a t test, testing accessibility in
   a given cell class against accessibility in all other cell classes. We
   consider cCREs with differentially high accessibility (regularized LR
   coefficient > 0, log[2] FC > 0, and t test P[adj] < 0.05; fig. S18) to
   be candidate regulators of cell type–specific genes ([174]Fig. 5A).

Fig. 5. The landscape of cis-regulatory interactions in the Macaque Brain
Atlas.

   [175]Fig. 5.
   [176]Open in a new tab

   (A) Schematic outlining criteria for identification of cCREs. Squares
   represent peak:gene pairs, darker colors symbolize stronger evidence
   for a given measure, and solid borders represent statistically
   significant measures. (B) Distribution of gene-peak GLUE regulatory
   scores binned according to the minimum signed distance (left: upstream,
   right: downstream) between peaks and gene TSSs. Distributions are shown
   separately according to whether the gene:peak pair also exhibited a
   significant association (P[adj] < 0.05) based on the metacell-based
   logistic regression analysis. The ratio between significant and
   not-significant gene-peak pairs for proximity bins according to the
   logistic regression model is shown in the top margin, while the
   distribution of GLUE regulatory scores is shown in the right margin.
   (C) Candidate regulatory elements are shown in relation to, from top to
   bottom, (i) the strength of and direction of inferred regulatory links
   connecting peaks to MBP expression (based on metacell logistic
   regression analysis). The height of links represents the strength
   (−log[10] P values) of evidence for regulatory connections and the
   color symbolizes the direction of the relationship; (ii) the
   differential accessibility (−log[2] FC) of peaks in oligodendrocytes
   relative to all other cell classes; (iii) the distribution of
   normalized snATAC-seq reads by cell class; (iv) gene and transcript
   boundaries of MBP and its known isoforms in the rhesus macaque genome,
   with exons shown in blue; (v) the distribution of normalized snRNA-seq
   reads by cell class. Oligodendrocyte reads are shown in relation to all
   other cell classes on the upper portion of the plot. On the bottom
   portion, the y axis is magnified ×60 and cropped to highlight more
   subtle differences among cell classes.

   We focused our analysis on the 6000 most variable genes in our
   snRNA-seq dataset and tested all snATAC-seq peaks that fell within 150
   kb of the gene promoter (defined as TSS extended 2 kb upstream). In
   total, we tested 223,752 peak-gene pairs (151,083 unique peaks, 5765
   unique genes), of which 142,324 peak-gene pairs (63.6%) met our
   criteria for being considered candidate cis-regulatory interactions
   (table S15). A total of 128,741 peaks (85.2%) that we evaluated were
   cCREs for at least one gene, and 4811 genes (83.5%) that we evaluated
   had at least one cCRE.

   Of all peak-gene pairs, 132,805 (93.3%) involved a peak that was highly
   differentially accessible in at least one cell class, thereby
   fulfilling our criteria for being considered candidate cis molecular
   interactions regulating cell type–specific markers. cCREs were highly
   differentially accessible in a maximum of seven cell classes, with 37%
   exclusive to a single cell class and 88% highly differentially
   accessible in one to three cell classes.

   The vast majority (133,496 or 93.8%) of candidate regulatory
   interactions were positively associated (i.e., had a positive effect
   size in the metacell logistic regression)—this held true whether peaks
   were upstream (13,650 of 14,575 or 93.7%), downstream (116,939 of
   124,592 or 93.9%), or overlapping (2907 of 3157 or 92.1%) the gene’s
   transcription start site (TSS). For peak-gene pairs where the peak was
   upstream of the TSS, the GLUE regulatory scores were highest
   (indicating high similarity between peak and gene feature embeddings)
   when peaks were in closer proximity to the TSS ([177]Fig. 5B). For
   peaks downstream of the TSS, GLUE regulatory scores remained high
   across all distances, with only a modest decrease farther from the TSS
   ([178]Fig. 5B). This result was particularly notable for peaks that had
   significant, mainly positive, associations between accessibility and
   gene expression, likely reflecting (i) higher global accessibility
   across the gene body resulting from higher expression of the gene (as
   opposed to distal regulation) and/or (ii) methodological limitations of
   using a single gene-wide TSS (i.e., the most upstream TSS of all
   isoforms), thereby ignoring variation in TSS positioning among
   isoforms, which likely vary in their usage across tissues and contexts
   ([179]68).

   Using the cell class–specific gene expression and cCRE peak sets, we
   repeated our integration, regulatory inference, and differential
   accessibility workflows on each cell class individually. We tested a
   mean of 72,914 peak-gene pairs (range: 45,539 to 114,200) per cell
   class and identified a mean of 11,442 peak-gene pairs (range: 881 to
   41,966) showing evidence of regulatory interactions (fig. S19 and table
   S16).

   To illustrate how these maps of putative interactions might be useful
   to investigate the regulatory landscape at the level of an individual
   locus, we focused on the myelin basic protein (MBP) gene
   ([180]Fig. 5C), which encodes one of the most abundant proteins in
   central nervous system myelin ([181]69, [182]70), has a range of splice
   isoforms ([183]71), and is a canonical marker of oligodendrocytes. MBP
   is located on chromosome 18 (positions 2,932,531 to 3,086,873) on the
   rhesus macaque (Mmul_10) genome and has eight annotated mRNA isoforms
   (Ensembl). In humans, classic MBP isoform 3 (18.5 kDa) predominates in
   adult myelin ([184]71).

   In our global peak set (all cells), 94 peaks fell within 150 kb of the
   MBP promoter and were included in our analysis. Of these peaks, 83
   (88.3%) were identified as candidate regulators of MBP (crMBP), with 38
   crMBPs (45.8%) positively associated with MBP expression. Of all
   crMBPs, only one was not located within the MBP gene boundaries—it was,
   however, located less than 2 kb upstream within the likely promoter
   region.

   In accordance with the well-known status of MBP as an oligodendrocyte
   marker, we found that MBP was differentially expressed in
   oligodendrocytes, with detected expression in 80.9% of cells and
   1434-fold higher expression than all other cells averaged together.
   Fine-grained inspection of normalized read distributions from
   oligodendrocyte nuclei revealed the highest densities of snRNA-seq
   reads corresponding to the polyadenylation site (position 3,086,373)
   and snATAC-seq reads corresponding to the TSS (position 3,046,976) of a
   single transcript, ENSMMUT00000015870, indicating that it is likely the
   dominant MBP isoform expressed in adult macaque oligodendrocytes.

   By examining the genomic-distance relationships between crMBPs and the
   dominant MBP transcript in adult oligodendrocytes, we found that all 16
   crMBPs that either overlapped or were downstream of the isoform’s TSS
   were positively associated with MBP expression. Among the 67 crMBPs
   that were located upstream of the TSS, 22 (32.8%) were positively
   associated with MBP expression, while 45 (67.2%) were negatively
   associated. Several of these negatively associated crMBPs corresponded
   with sci-ATAC-seq3 peaks in other cell types, particularly OPCs and
   microglia ([185]Fig. 5C). However, the accessibility landscape of OPCs
   is overall more similar to that of oligodendrocytes across the region
   upstream of the TSS of the dominant isoform, with greater accessibility
   at most peaks except for that of the promoter of the dominant isoform
   ([186]Fig. 5C). As OPCs play a critical role in myelinogenesis by
   giving rise to oligodendrocytes ([187]72), these crMBPs likely serve as
   critical markers of the OPC-oligodendrocyte transition, during which
   the expression of this gene, and this isoform in particular, is
   massively up-regulated.

Evolutionary conservatism and divergence of candidate regulatory elements

   Human brain specialization could be driven in part by changes in cell
   type composition and function following the evolutionary divergence of
   cell type–specific regulatory elements. To evaluate evolutionary
   divergence/conservatism, we tested whether cCREs that were
   differentially accessible across cell classes (table S15) or unique to
   cell classes (table S17) were associated with regions that underwent
   rapid evolution in the human lineage [i.e., human accelerated regions
   (HARs) ([188]73); human ancestor quickly evolved regions (HAQERs)
   ([189]74)] or are differentially accessible between humans and
   chimpanzees in cerebral organoids (DA[HC] regions) ([190]75). We also
   tested whether cCREs were enriched for these evolutionarily salient
   regions (relative to all peaks, called on the global set) and did not
   detect any enrichments (table S17). We detected an enrichment of DA[HC]
   regions among differentially accessible cCREs (OR = Inf; P[adj] =
   0.046) and a depletion of DA[HC] regions among glutaminergic
   neuron-specific cCREs (OR = 0.263; P[adj] = 0.046) (table S17). These
   findings indicate that regulatory areas exhibiting differential
   accessibility across cell types in the adult macaque brain may also
   exhibit differential accessibility in the developing human versus
   chimpanzee brain, while cell class–specific regions may be more
   conserved during primate brain development. Neither HARs nor HAQERs
   were enriched among differentially accessible or cell class–specific
   cCREs (P[adj] > 0.05), which was expected given that intercell class
   variation in orthologous gene expression is generally well-conserved in
   primates ([191]76). We did identify six differentially accessible or
   cell class–specific cCREs that overlapped with HARs or HAQERs (table
   S17). There was one cCRE that was differentially accessible in macaque
   medium spiny neurons, cerebellar neurons, and vascular cells that
   overlapped with two HAQERs. These cCREs are also: (i) located in the
   1q21.1-2 region containing the NBPF gene cluster that contains several
   human-specific segmental duplications; (ii) areas of open chromatin in
   the developing human brain; and/or (iii) regions showing differential
   chromatin accessibility between human and chimpanzee cerebral organoids
   ([192]74). While the presence of HAQERs in these regions suggests that
   they may function differently in humans versus macaques, the human
   sequences for these HAQERs do not appear to exhibit significantly
   greater enhancer activity compared to nonhuman primate sequences
   ([193]74).

Enrichment of disease heritability among candidate regulatory elements

   Last, we used our cCREs to identify cell type–associated regulatory
   networks that may drive polygenic disease risk. We tested for
   enrichment of disease trait heritability using the linkage
   disequilibrium score regression (LDSC) tool ([194]77, [195]78), after
   lifting over macaque cCREs to human genome coordinates ([196]28). We
   tested a total of 53 phenotypes relevant to neurological diseases,
   disorders, syndromes, behaviors, or other traits (table S18) and
   examined enrichment among cell class cCREs called separately in each of
   11 cell classes.

   Our results broadly recapitulated several known roles of cell classes
   in neurological disease ([197]Fig. 6 and table S19). For example, sites
   associated with cardioembolic stroke (OR = 32.2) or ischemic stroke (OR
   = 9.2) were enriched (P[adj] < 0.05) only in vascular cells, which play
   a crucial role in forming and maintaining the blood-brain barrier
   ([198]79). We also found that Alzheimer’s disease–associated sites were
   enriched only in microglia—a result replicated using loci from three
   independent genome-wide association studies (GWAS) (OR range:
   13.9–15.0)—consistent with the prominent role of microglia
   proliferation and activation in Alzheimer’s disease ([199]80).

Fig. 6. Enrichment of heritable disease-relevant sites among candidate
regulatory elements.

   [200]Fig. 6.
   [201]Open in a new tab

   The heatmap displays heritability enrichment (log[2] OR) of diseases
   among cell class snATAC-seq peaks for tested diseases, syndromes, and
   phenotypes. Only results passing a threshold of P[adj] < 0.05 are
   shown. The log[2] OR color range is capped at 3.0.

   Across all cell classes, basket cells were enriched for the greatest
   number (N = 37) of GWAS phenotypes, including disorders such as
   schizophrenia (OR range: 5.9 to 6.2), bipolar disorder (OR range: 5.6
   to 6.2), and major depressive disorder (OR range: 5.1 to 5.3) and, most
   strongly, epilepsy (OR = 9.0)—a disease that basket cells have been
   connected to in animal models and some genetically linked human forms
   of the disease ([202]81).

   Other notable results included the enrichment of multiple
   sclerosis-associated sites among open regions in microglia (OR = 46.6),
   highlighting the outsized role of these immune cells in the etiology of
   the disease and as a putative therapeutic target ([203]82, [204]83). In
   multiple sclerosis, disease-associated microglia alter their
   transcriptional profiles and may contribute to neuroinflammatory
   processes underpinning this autoimmune disorder ([205]83). We also
   found enrichment of Parkinson’s disease–associated sites among open
   regions in the glial OPC, oligodendrocyte, and astrocyte cell classes
   (OR range: 7.0 to 8.4). In Parkinson’s disease, glial cells may play a
   major role in the progressive degeneration of dopaminergic neurons
   ([206]84), a classic hallmark of Parkinson’s disease, or in alterations
   to glutamatergic neurotransmission ([207]85).

   Last, we found that heritable sites associated with attention
   deficit/hyperactivity disorder (ADHD) in our analysis were enriched
   only among open regions of medium spiny neurons. While the magnitude of
   the enrichment was relatively mild (OR = 2.6, P[adj] = 0.031), genetic
   variants associated with ADHD have been historically difficult to
   identify, with the first risk loci only recently reported ([208]86).
   Medium spiny neurons have been linked to behavioral hyperactivity and
   disrupted attention via activation of astrocyte-mediated synaptogenesis
   ([209]87). Our results therefore suggest that medium spiny neurons may
   be a promising target for prospective ADHD therapeutics warranting
   further study.

DISCUSSION

   Understanding the cellular architecture of the primate brain is crucial
   both for understanding the evolution of human cognition and behavior
   and for identifying mechanisms underlying neurological disorders. In
   service of these goals, we used snRNA-seq and snATAC-seq to derive a
   molecular atlas spanning the adult rhesus macaque brain, comprising
   data from over 4 million cells profiled from 30 brain regions. On the
   basis of our multimodal molecular data, we identified 112 distinct
   molecular cell types or subtypes and characterized their distribution
   across the macaque brain, adding to the growing number of primate
   single-cell molecular brain atlases ([210]15, [211]32). The data are
   freely available (NeMO archive, nemo:dat-rtmm5q2) and browsable
   (CELLxGENE viewer,
   [212]https://cellxgene.cziscience.com/collections/8c4bcf0d-b4df-45c7-88
   8c-74fb0013e9e7) and will serve as a rich resource for the neuroscience
   and neurogenomics communities.

   In generating a multiregion transcriptomic and epigenomic atlas of the
   most widely used nonhuman primate in neuroscience, we (i) identified
   all of the major brain cell classes and many cell types that have been
   previously reported ([213]Figs. 1 and [214]2); (ii) quantified regional
   distribution of cell types and subtypes within individuals, which
   allowed us to identify compositional differences in samples collected
   at the same time and from the same animals ([215]Fig. 2); (iii)
   identified rare and regionally specific cell types (e.g., Purkinje
   cells), which may facilitate the development of molecular tools such as
   cell type–specific viral vectors that, in combination with new
   technologies such as CellREADR (Cell access through RNA sensing by
   Endogenous ADAR) ([216]88), may enable precise targeting of cell types
   based on their unique patterns of chromatin accessibility and gene
   expression; (iv) characterized multiple trans- and cis-regulatory
   mechanisms that differentiate cell classes and subtypes ([217]Figs. 4,
   A to D, and [218]5); and (v) identified numerous associations between
   genetic risk for neurological disorders and the epigenomic states of
   specific cell types ([219]Fig. 6).

   Notably, this single-cell atlas of the adult primate is generated from
   samples collected from healthy adults. The paired nature of the
   dataset, with regions sampled from the same individual brains, avoids
   many of the inter-individual variables (e.g., genotype and environment)
   that can affect neurological development and function. The atlas may
   thus be a valuable resource for characterizing molecular features that
   play a role in myriad neurological disorders. The relatively few unique
   individuals sampled also represent a limitation of the current study—we
   currently know very little about how brains of healthy individuals
   differ in cell composition and function and what that confers for
   disease susceptibility and/or progression. Given continuing
   improvements in cost and throughput of single-cell sequencing,
   characterizing multiregion cellular variation across many healthy
   individuals is becoming not only a possibility but also an emerging
   priority for the field.

   To our knowledge, these data represent the largest and most
   comprehensive multimodal molecular atlas in any nonhuman primate to
   date and provide a resource for exploring how the heterogeneous
   molecular and cellular composition of the brain gives rise to the
   behavioral complexity of primates including humans. We anticipate that
   these data will also provide a critical and much-needed molecular and
   neurobiological map of complex human-relevant social behavior and
   disease, as well as an extensive substrate for comparative analyses
   across animal brains.

MATERIALS AND METHODS

Study population and sample collection

   All animals sampled in this study are rhesus macaques (M. mulatta) from
   the semi-free-ranging colony on the island of Cayo Santiago, Puerto
   Rico. Maintained by the Caribbean Primate Research Center (CPRC) within
   the University of Puerto Rico, the Cayo Santiago macaque colony has
   been largely continuously studied since its founding in 1938 ([220]89).
   All present-day macaques are descended from an initial founder
   population of 409 animals and have since maintained an outbred
   population structure despite generations of isolation ([221]90). Apart
   from being provisioned with commercial feed and occasionally subject to
   capture-and-release sampling, the macaques otherwise live in
   naturalistic conditions, subject to minimal intervention and
   manipulation, as approved by Institutional Animal Care and Use
   Committee. The study used animals that needed to be removed from Cayo
   Santiago ([222]91) and were immediately euthanized. Standardized tissue
   collection and sample archiving were coordinated by the Cayo Biobank
   Research Unit (CBRU), which provided the brain samples used in this
   study ([223]92, [224]93).

   Procedures for necropsy, brain removal, and dissection followed those
   previously described for this population ([225]93) and are briefly
   outlined here. Following veterinary euthanasia, brains were perfused
   with sterile saline, removed from the cranium, and hemisected into left
   and right hemispheres using a long single-edge razor blade. After
   sectioning off the cerebellum/brainstem from each hemisphere, the
   cerebral hemispheres were placed on custom molds (designed either for
   left or right hemispheres) and coronally sectioned into 11 roughly
   5-mm-thick blocks, numbered in order rostral to caudal. All 12 blocks
   (with the cerebellum/brainstem considered block 12) were then sealed in
   Whirl-pak bags, flash-frozen in liquid nitrogen vapor, and archived in
   ultralow −80°C freezers. The interval between euthanasia and permanent
   storage of frozen tissue averaged 51 min, with an SD of 5.8.

   All procedures were performed in accordance with the National
   Institutes of Health (NIH) Guide for the Care and Use of Laboratory
   Animals and were approved by the Institutional Animal Care and Use
   Committee at the University of Puerto Rico (protocol #338300). Five
   macaques were included in this study (table S2). The vast majority of
   the data were derived from four 10-year-old macaques, which are
   considered middle-aged adults in this population ([226]93, [227]94).

Region selection and biopsy

   Frozen brain blocks were placed on a dissection tray over dry ice to
   keep tissue frozen during biopsy collection. Individual blocks were
   then moved from the dry ice to a tray sitting on wet ice, allowing for
   tissues to be acutely warmed to the point that biopsies could be taken
   from targeted structures. Biopsies were made using a cutting spoon
   (Fine Science Tools Inc., catalog no. 10360-13). Dissected brain
   regions are listed in table S1, and approximate locations for biopsy
   are illustrated in [228]Fig. 1A. For a given structure, attempts were
   made to minimize inclusion of off-target surrounding tissues (e.g.,
   white matter underlying a targeted gray matter structure). Below, we
   document the most common block numbers where structures were located.
   Because of interindividual differences and/or variation in sectioning,
   regions of interest were sometimes identified and dissected from
   adjacent blocks based on neuroanatomical landmarks. Alternate block
   numbers are therefore also documented below.

   The most anterior block sampled for this study (block 2) contained gray
   matter for the dorsomedial (dmPFC), ventromedial (vmPFC), dorsolateral
   (dlPFC), and ventrolateral prefrontal cortices (vlPFC). dmPFC and vmPFC
   were defined as being on the medial side of block 2. The dmPFC biopsy
   was pulled from the gray matter in the ~top half of the medial edge of
   the block. A space along the medial edge was left to separate dmPFC
   from vmPFC. The vmPFC biopsy was pulled from the medial ventral half of
   the tissue block. Biopsies of dlPFC came from the cortical tissues
   surrounding the dorsal lateral portion of the block that included the
   superior and inferior portions of the principal sulcus. Samples from
   vlPFC came from the ventral and lateral portion of the block. As was
   the case on the medial side, a portion of the cortex was left between
   each lateral biopsy to avoid overlap ([229]Fig. 1A).

   Block 3 (sometimes 4) contained biopsies for the anterior cingulate
   cortex (ACC), CC, and head of the CN. The biopsy for ACC was the gray
   matter sitting between the CC, which is ventral to ACC and the
   cingulate sulcus (cs), which sits dorsal to the cingulate gyrus. CC was
   defined as the white matter track sitting ventral to the ACC and medial
   to the lateral ventricle. The CN was the gray matter sitting
   ventrolateral to the lateral ventricle and surrounded on all other
   sides by white matter. The CN was the only biopsy in the second block
   that was scooped out of the block face to minimize inclusion of any
   white matter sitting anteriorly past the CN within the block
   ([230]Fig. 1A).

   Block 5 (sometimes 4 or 6) contained the amygdala (AMY), entorhinal
   cortex (EC), perirhinal cortex (PC), and NAc. The NAc is located
   ventral to the caudate, internal capsule, and putamen (Pu).
   Furthermore, in fresh-frozen tissue, there was a slightly darker color
   to the NAc. The tissue making up the NAc was scooped out of the block
   face. Similarly, the AMY was identified as ventral to the Pu, medial to
   the ventral portion of the claustrum, and dorsal to the EC. The AMY was
   also scooped out to minimize the inadvertent collection of neurons
   within the hippocampus (HIP). Last, the EC and PC were collected, the
   delineation between the two was the rhinal fissure ([231]Fig. 1A).

   Blocks 5 and 6 (sometimes 4 or 7) contained tissue that were biopsied
   to represent cortical regions primary motor cortex (M1), primary
   somatosensory cortex (S1), primary auditory cortex (A1), superior
   temporal cortex (STS), and IT. Subcortical structures that were
   biopsied included mdTN, vlTN, LGN, and HIP. The delineation between M1
   and S1 was the central sulcus and was taken from the approximate
   central third of the lateral portion of each respective gyrus. Within a
   case, attempts were made to biopsy from approximately the same putative
   mototopic and somatotopic regions. A1 biopsies were taken from the
   dorsal portion of the superior temporal gyrus, which is within the
   lateral sulcus (ls) (i.e., inferior operculum). The gray matter forming
   the STS sits ventral to the superior temporal gyrus and dorsal to IT.
   IT was defined as the gray matter forming the lateral portion of the
   inferior temporal cortex. mdTN sits bilaterally on midline, within the
   thalamus. It is bound by ventricles dorsally, laterally by the
   centrolateral thalamic nucleus and ventrally by the centromedial
   thalamic nucleus. vlTN is bound by the centrolateral thalamic nucleus
   medially, body of the CN dorsally, and the reticular thalamic nucleus
   laterally. Biopsies for mdTN were taken from the central and central
   medial portions of the nucleus, while vlTN biopsies were taken from the
   central portion of the nucleus. In both cases, this was in an effort to
   avoid inclusion of other thalamic nuclei. The LGN is a six-layered
   structure that is easily observed on the coronal face of fresh-frozen
   slabs. When observed, the biopsy was scooped out. Like the LGN, the HIP
   was defined by its classic cytoarchitectonic features within the medial
   temporal lobe. For biopsies, efforts were made to not include EC, which
   sits ventral and ventromedial to HIP ([232]Fig. 1A).

   Block 7 (sometimes 6 or 8) contained tissues representing the superior
   posterior parietal (SPP), inferior posterior parietal (IPP), and the
   middle temporal visual area (MT). SPP biopsies were from the gray
   matter of the superior lobule. The intraparietal sulcus sits between
   SPP and IPP. Therefore, IPP biopsies were taken from the gray matter of
   the second, more lateral lobule. Last, area MT was defined by the gray
   matter of the insular cortex, bound on its medial edge by white matter
   of the extreme external capsule and laterally by the superior and
   inferior operculum divided by the STS ([233]Fig. 1A) ([234]95,
   [235]96).

   The final cerebral block, block 11, contained the visual cortex.
   Biopsies from V1 were taken from the dorsolateral surface gray matter
   above the external calcarine sulcus ([236]Fig. 1A).

   The hemisected cerebellum/brainstem block was dissected as follows.
   First, the cerebellum was dissected off and the cerebellar vermis (CV)
   was separated from the lateral cerebellar cortex (lCb). Next, the
   remaining brainstem was dissected such that the MB block was separated
   by making a cut from just behind the inferior colliculus to the top of
   the basilar pons. Next, the pons was separated from the medulla by
   making a cut from the stria medullaris (approximate center of the
   fourth ventricle) to the base of the pons. A final cut at the base of
   the fourth ventricle to separate the open medulla (MdO) from the MdC
   ([237]Fig. 1A).

   To allow for the profiling of multiple genomic modalities from the same
   representative cell populations, we pulverized all biopsies on dry ice
   to homogenize and divide tissue for downstream experiments. We followed
   the tissue pulverization procedures described by Domcke et al.
   ([238]97) to achieve a powder consistency on a sterile aluminum foil
   work surface. Once sufficiently pulverized, we stirred the sample
   thoroughly and then divided the sample using the folded edge of foil as
   a funnel into new 1.5-ml prechilled and prelabeled microcentrifuge
   tubes. Foil and tubes were set on aluminum trays or tube racks set on
   dry ice to keep powdered tissue frozen throughout this process. We
   divided samples into roughly a 2:1 ratio given the expected
   efficiencies/yields for snRNA-seq and snATAC-seq protocols,
   respectively. Pulverized tissue was stored at −80°C up until processing
   for downstream library preparation procedures.

snRNA-seq data generation

   To profile single-nucleus gene expression, we performed snRNA-seq using
   the sci-RNA-seq3 approach ([239]19), which is the improved version of
   the original sci-RNA-seq protocol ([240]18). For two of the three
   experimental batches in our dataset, we used a protocol closely
   adhering to the sci-RNA-seq3 protocol described by Cao et al.
   ([241]19). For the third batch, we used the improved protocol (“tiny
   sci”) described by Martin et al. ([242]21). Sample order was randomized
   between the first two batches, and within the third batch, to minimize
   batch effects and other technical artifacts.

   For the first two batches, we slightly modified the protocol described
   by Cao et al. ([243]19, [244]20) for a different tissue type and
   smaller input amounts. Briefly, we added 50 μl of cell lysis buffer to
   pulverized tissue in a 1.5-ml microcentrifuge tube and then homogenized
   the tissue using 5 to 10 strokes with a disposable ribonuclease
   (RNase)–free plastic pestle (Fisherbrand, catalog no. 12-141-364). We
   then added another 950 μl of cell lysis buffer, mixed by pipette, and
   then transferred the suspension through a 70-μm cell strainer
   (pluriSelect, catalog no. 43-10070-70) into a 15-ml conical tube
   containing 5 ml of ice-cold 4% paraformaldehyde. Nuclei were fixed in
   4% paraformaldehyde for 15 min with occasional mixing, washed once in
   1-ml ice-cold nuclei wash buffer, and then suspended in 200-μl nuclei
   wash buffer. Nuclei were counted by mixing with 1 μM YOYO-1 iodide
   (Thermo Fisher Scientific, catalog no. Y3601) using a Countess II FL
   automated cell counter (Life Technologies), divided into tubes in
   100-μl aliquots, and then flash-frozen in liquid nitrogen.

   For nuclei fixed with paraformaldehyde, library construction was
   similar to the sci-RNA-seq3 method from Cao et al. ([245]19) with minor
   modifications including the substitution of Quick Ligase (New England
   Biolabs) for 10 min at 25°C for the second index step, instead of T4
   DNA ligase (NEB) for 180 min at 16°C. For tagmentation, we used N7
   adaptor-loaded Tn5 from QB3 MacroLab at the University of California
   Berkeley in tagmentation buffer (2× TD) as previously described in
   Corces et al. ([246]98): 20 mM tris-HCl (pH 7.5), 10 mM MgCl[2], and
   20% (vol/vol) dimethylformamide. Libraries were sequenced on a NextSeq
   or NovaSeq platform (Illumina) (read 1: 34 cycles, read 2: 100 cycles,
   index 1: 10 cycles, index 2: 10 cycles).

   For the [dithiobis (succinimidyl propionate)] (DSP)/methanol nuclei
   isolations and library construction based on Martin et al. ([247]21),
   we used hypotonic lysis buffer solution B (with bovine serum albumin)
   for small volume tiny sci-RNA-seq3 nuclei isolation methods. For
   sci-RNA-seq3 library construction, we loaded ~20,000 nuclei per index 1
   reverse transcriptase (RT) well in a 384 RT-well experiment with mouse
   and human brain added as separate QC nuclei and nuclei from cell lines
   human embryonic kidney (HEK) 293T (RRID:CVCL_0063) and NIH/3 T3
   (RRID:CVCL_0594) combined as barnyard controls per RT plate. Nuclei
   from all RT plates were pooled and redistributed to ligation plates for
   the second index as previously published; after the addition of the
   second index, nuclei were again repooled for their final distribution
   of 4000 nuclei per well before second strand synthesis, protease
   digestion, tagmentation, and polymerase chain reaction all on this
   final third index plate.

snRNA-seq preprocessing

   snRNA-seq sequencing reads were processed into a gene-by-nucleus
   expression matrix of UMI counts following the methods described by Cao
   et al. ([248]19). We used largely an identical pipeline which, briefly,
   (i) converts base calls to fastq files with bcl2fastq/v.2.20
   (RRID:SCR_015058) (Illumina), (ii) removes adapter sequences using Trim
   Galore/v.0.6.7 (RRID:SCR_011847) ([249]99), (iii) aligns trimmed reads
   to a reference genome with STAR/v.2.7.6 (RRID:SCR_004463) ([250]100),
   (iv) extracts mapped reads, (v) removes duplicates, and (vi) generates
   UMI counts for exonic and intronic regions of each gene, tabulated
   according to the unique three-level barcode design in sci-RNA-seq3. We
   used the rhesus macaque reference genome (Mmul_10) ([251]101) and
   annotation, obtained from Ensembl (version 101) (RRID:SCR_002344). We
   extended the 3′ untranslated region annotations of genes and
   transcripts by 500 bp to avoid misclassifying genic reads as
   intergenic. The remainder of our pipeline followed the procedures
   described by Cao et al. ([252]19). After generating the count matrix,
   we removed all nuclei with UMI counts < 100.

   For each sample, we imported gene-by-nucleus count matrices into the
   AnnData/v.0.8.0 (RRID:SCR_018209) ([253]102) framework and then ran
   Scrublet/v.0.2.3 (RRID:SCR_018098) ([254]103) (expected_doublet_rate =
   0.05) to calculate doublet scores. We marked nuclei as doublets if they
   had Scrublet doublet scores > 0.20. For each sample, we additionally
   marked nuclei as doublets using per-sample thresholds determined by
   Scrublet and adjusted by eye as necessary to separate bimodal peaks
   visualized on the Scrublet doublet score histogram (fig. S1).

   To further identify potential doublet nuclei, we used an iterative
   clustering strategy ([255]104) implemented with Scanpy/v.1.9.1
   (RRID:SCR_018139) ([256]105). First, we combined all nuclei into a
   single AnnData object and filtered nuclei to those with UMI ≥ 100,
   number of expressed genes < 2500, and a percentage of reads mapping to
   the mitochondrial genome < 5%. We then removed all non-autosomal genes,
   genes located on unplaced scaffolds, and unexpressed genes. Next, we
   normalized the data to the total UMI per nucleus, logarithmized the
   data, and subsetted the data to the 10,000 most variable genes. For
   each cell, we regressed out total UMI counts per nucleus and then
   mean-centered and scaled the data. The dimensionality of the data was
   then reduced by principal components analysis (PCA) (50 components). To
   further reduce the dimensionality, we ran a UMAP (using
   umap-learn/v.​​0.5.2) (RRID:SCR_018217) analysis ([257]106) with
   BBKNN/v.1.5.1 (RRID:SCR_022807) ([258]107) to simultaneously correct
   for batch differences. For the BBKNN integration, we set
   neighbors_within_batch = 10 (given three batches, tantamount to UMAP
   n_neighbors = 30), used the cosine distance metric, and used the
   PyNNDescent/v.0.5.6 algorithm (RRID:SCR_022806) ([259]108). We then ran
   UMAP using the settings min_dist = 0, spread = 1.0, and n_components =
   10 to facilitate clustering
   ([260]https://umap-learn.readthedocs.io/en/latest/clustering.html). For
   data visualization only (not clustering), we ran a similar BBKNN/UMAP
   pipeline with neighbors_within_batch = 5 (for three batches, tantamount
   to UMAP n_neighbors = 15), min_dist = 0.25, spread = 1.0, and
   n_components = 2. To cluster the data, we exported and imported the
   10-dimensional UMAP matrix into Monocle3/v.1.2.9 (RRID:SCR_018685)
   ([261]19) in R/v.4.0.2 (RRID:SCR_001905) ([262]109) and then
   implemented the Leiden-clustering workflow in Monocle3 with a
   relatively high-resolution setting (resolution = 1 × 10^−4). For each
   cluster, we then calculated the mean Scrublet doublet score and marked
   all clusters with a mean Scrublet doublet score > 0.15 as doublet
   clusters (fig. S1).

   After identifying doublets as described above, we removed all marked
   doublets and repeated the normalization, dimensionality reduction, and
   clustering procedures almost exactly as described above, with the only
   difference being a coarser cluster resolution setting in Monocle3
   (resolution = 1 × 10​​^−5). We confirmed adequate removal of doublet
   cells by observing the clean separation of distinct cell types and the
   absence of clusters expressing obviously ambiguous marker gene profiles
   (fig. S1).

Removal of sci-RNA-seq cell contamination

   During the course of cell type identification (see the following
   section), we observed the presence of two distinct clusters of cells
   (fig. S2A) with expression profiles resembling embryonic progenitors
   (markers genes, unknown cluster 1: ASPM, CENPE, CENPF, MKI67; unknown
   cluster 2: COL1A1, COL1A2, FN1, VIM), an unusual finding in adult
   primate brain samples. Because these were present in relatively large
   proportions in some samples (~25%)—but at low levels overall (2.2%)—and
   because our sci-RNA-seq experiments included control samples of
   exogenous (i.e., non-macaque brain) origin (specifically, a fetal mouse
   brain positive control and a “barnyard” sample consisting of mixed
   human HEK293T and mouse NIH/3 T3 cells), we tested for the presence of
   contaminating nuclei of exogenous origin. We identified and removed
   contaminating cells as follows.

   Because the only non-macaque samples included in all experiments were
   the control samples of either human or mouse origin, we used
   BBSplit/v.38.38 (RRID:SCR_016965) ([263]110) to assign reads to the
   macaque, human, or mouse genomes. BBSplit is a competitive aligner that
   maps to several references simultaneously, assigning reads to the