Abstract

   The tangential expansion of the human cerebral cortex, indexed by its
   surface area (SA), occurs mainly during prenatal and early postnatal
   periods, and is influenced by genetic factors. Here we investigate the
   role of rare copy number variants (CNVs) in shaping SA, and the
   underlying mechanisms, by aggregating CNVs across the genome in
   community-based cohorts (N = 39,015). We reveal that genome-wide CNV
   deletions and duplications are associated with smaller SA. Subsequent
   analyses with gene expression in fetal cortex suggest that CNVs
   influence SA by interrupting the proliferation of neural progenitor
   cells during fetal development. Notably, the deletion of genes with
   strong (but not weak) coexpression with neural progenitor genes is
   associated with smaller SA. Follow up analyses reveal similar
   mechanisms at play in three clinical CNVs, 1q21.1, 16p11.2 and 22q11.2.
   Together, this study of rare CNVs expands our knowledge about genetic
   architecture of human cerebral cortex.

   Subject terms: Nervous system, Development of the nervous system
     __________________________________________________________________

   Variation in cortical surface area in adults can reflect developmental
   events occurring during prenatal and early postnatal periods. Here, the
   authors find rare copy number variants associated with cortical surface
   area, which are also found to disrupt neural progenitor proliferation
   during fetal development.

Introduction

   The human cerebral cortex, essential for multiple perceptual, motor,
   and cognitive processes, is a highly folded sheath of tissue with about
   1800 cm^2 of surface area^[44]1. Inter-individual variations in the
   cortical surface area (SA) contribute to the variations in cognitive
   abilities and vulnerability to mental illness^[45]2–[46]4. The cerebral
   cortex expands dramatically during prenatal periods and the first two
   years of life, with subsequent age-related increases in SA being
   limited^[47]1,[48]5–[49]7. Thus, individual variations in SA observed
   in adulthood reflect mainly developmental events occurring during the
   prenatal and early postnatal periods. The cortical expansion during the
   prenatal period has been of great interest for decades. During fetal
   development, radial glia can self-renew and generate intermediate
   progenitor cells (IPCs), which can self-amplify prior to generating
   neurons^[50]8–[51]11. Based on a large body of work carried out in
   non-human primates, the radial unit hypothesis proposed that the
   proliferation of neural progenitor cells plays a central role in the
   tangential growth of cerebral cortex^[52]12. Increasing the
   proliferation of progenitor cells experimentally in the mouse embryo
   results in more neurons and a larger SA^[53]13.

   Tangential expansion of the cerebral cortex is largely influenced by
   genetic variations, with about 90% heritability of SA as estimated in
   twin studies^[54]14. Genome-wide association studies (GWASs) have
   identified hundreds of loci related to SA of the adult cerebral cortex.
   These common genetic variants are enriched in regulatory activities in
   neural progenitor cells, which is consistent with the radial unit
   hypothesis^[55]15–[56]17. Yet, most of these genetic variants reside in
   non-coding regions, posing a challenge regarding whether and how these
   variants influence gene expression and biological processes. In
   contrast, rare copy number variants (CNVs) with protein-coding gene(s)
   fully deleted or duplicated lead to substantial decrease and increase,
   respectively, in gene expression in the brain^[57]18. Thus,
   investigating the effects of CNVs on SA can provide valuable insights
   into the intricate relationship between gene expression and cortical
   expansion. Previous research on the association between CNVs and SA has
   focused on a few specific genomic loci, such as 1q21.1, 16p11.2, and
   22q11.2, which confer high risk for psychiatric disorders such as
   autism and schizophrenia^[58]19–[59]23. These studies have reported
   large effects of these CNVs on SA but identifying the contributing
   genes within these multigenic variants remains challenging. Beyond
   these specific CNVs, the majority of CNVs are very rare (<1/5000), and,
   as such, their individual effects on the brain are difficult to assess
   with currently available sample sizes. A common strategy to overcome
   this limitation has been to aggregate rare CNVs across the genome based
   on gene content and characteristics (e.g., intolerance to
   loss-of-function) to test association with phenotypes such as cognitive
   abilities and autism^[60]24–[61]27.

   In this work, we investigated the relationship between CNVs and SA by
   pooling data from three community-based cohorts (IMAGEN, Saguenay Youth
   Study, and the UK biobank). By aggregating - across the genome - all
   CNVs (≥ 50 kilobases) based on genes’ intolerance to loss-of-function,
   we find that genome-wide deletions and duplications are associated with
   smaller SA. By utilizing virtual ontogeny and expression Quantitative
   Trait Loci (eQTLs) from the fetal cortex, we then test the explanatory
   power of the radial unit hypothesis vis-à-vis the development of the
   human cerebral cortex. Lastly, we apply similar approaches to gain
   insights into the developmental mechanisms underlying the effects of
   three specific CNVs related to neurodevelopmental disorders, namely
   1q21.1, 16p11.2, and 22q11.2, on cortical expansion. Our findings
   suggest that CNVs influence SA by interrupting the expression of genes
   involved in neural proliferation in the fetal period.

Results

CNVs and surface area

   Across the genome, all CNVs that were ≥ 50 kilobases and fully
   encompassed at least one protein-coding gene were included in this
   study (“Methods”). To estimate the association between genome-wide CNVs
   and cortical SA, we conducted mega-analyses using data from three
   community-based cohorts: the IMAGEN study, Saguenay Youth Study (SYS),
   and the UK Biobank (UKBB), including a total of 39,015 individuals with
   both magnetic resonance imaging (MRI) of their brains and genome-wide
   assessment of CNVs (Fig. [62]1A). In this dataset, we identified 2767
   unique protein-coding CNVs, of which 93% were present in fewer than
   five individuals (Supplementary Data [63]1). Out of the 39,015
   individuals, 2733 (1423 females) and 3859 (2011 females) individuals
   carried, respectively, at least one deletion or one duplication fully
   encompassing one or more protein-coding genes. Thus, there were 6309
   CNV carriers and 32,706 CNV non-carriers. Across the genome, 5822 genes
   were either deleted (2084 genes) and/or duplicated (4836 genes) at
   least once (Fig. [64]1A and Supplementary Data [65]2).

Fig. 1. Genome-wide CNVs and cortical expansion.

   [66]Fig. 1
   [67]Open in a new tab

   A Demographic of each cohort and the number of unique protein-coding
   genes being deleted and duplicated in the pooled dataset. B Workflow of
   the virtual ontogeny. We first estimated the effect sizes of
   genome-wide CNVs on cortical SA of 11 regions using data from
   community-based cohorts. Subsequently, “virtual ontogeny” analyses were
   conducted to investigate the underlying neurodevelopmental mechanisms
   by testing the relationships between the CNV-SA effect size profiles
   and the expression of cell-specific genes in the mid-gestational fetal
   cortex. C Estimated effect sizes of the burden of CNVs
   deletion/duplication on normalized SA (Z-score) in community-based
   cohorts (N = 39,015). We used the sum of 1/LOEUF scores of genes within
   CNVs for each individual to quantify the burden of CNVs. The centered
   dots represent the estimated effect size for each region. The error
   bars represent the standard error of the estimated effect size for each
   region. D Results of the virtual ontogeny. The dashed vertical line
   represents the mean biweight midcorrelation within each cell type. The
   gray box around zero represents 95% confidence intervals of the null
   distribution, which was derived by a permutation approach. The asterisk
   denotes p < 0.05 (FDR). DFC, dorsal frontal cortex; VFC, ventral
   frontal cortex; OFC, orbitofrontal cortex; MFC, medial frontal cortex;
   ITC, inferior temporal cortex; STC, superior temporal cortex; IPC,
   inferior parietal cortex; M1C, primary motor cortex; S1C, primary
   somatosensory cortex; A1C, primary auditory cortex; V1C, primary visual
   cortex. SYS Saguenay Youth Study, UKBB United Kingdom BioBank.

   Given the rarity of most CNVs with protein-coding genes, testing their
   individual effects on SA was not feasible with the currently available
   sample size. Instead, recent studies have aggregated rare CNVs based on
   general characteristics of involved genes, such as intolerance to
   loss-of-function, to explore the association of CNV burden with
   cognitive abilities and autism risk^[68]24–[69]27. Here, we used a
   well-established measure of genes’ intolerance to loss-of-function,
   namely, the inverse loss-of-function observed/expected upper-bound
   fraction (1/LOEUF), to estimate the total CNV burden for each
   individual^[70]28. Thus, a higher burden score indicates more deletions
   or duplications of “intolerant” genes in an individual (Supplementary
   Fig. [71]1). We then quantified the effect sizes of the CNV burden on
   normalized SA (Z-scored across participants) of 11 cortical regions
   using linear mixed-effect models. These 11 regions of interest were
   selected based on the availability of gene-expression data obtained
   during the prenatal period, which were used in subsequent analysis
   (Supplementary Fig. [72]2). In addition, for each CNV with 20 or more
   carriers (35 CNVs), we provide the effect sizes on SA in Supplementary
   Fig. [73]3. We note that these recurrent CNVs have small effects on
   phenotypes, likely because they do not include genes that are
   intolerant to loss-of-function^[74]25,[75]26,[76]29. Estimating such
   small effects reliably would require substantially larger sample sizes.

   We found that both CNV deletions and duplications were associated with
   a smaller SA across the 11 regions (Fig. [77]1C). On average, across
   the 11 regions, a one-point increase in the burden score for deletions
   and duplications was associated, respectively, with SA decreases of
   0.0065 SD (p = 1.75E-05) and 0.0022 SD (p = 5.81E-04; Supplementary
   Data [78]3). Similar effect sizes were observed when utilizing the
   total number of genes deleted or duplicated in an individual as a
   simple measure of the CNV burden (Supplementary Fig. [79]4). The
   greater detrimental effects of deletions compared with duplications on
   SA aligns with previous research on cognitive abilities, where the
   effect of deletions is three times that of duplications^[80]24. Such
   difference may be attributed to larger changes in gene expression for
   CNV deletions compared with duplications^[81]18.

Virtual ontogeny for CNV-SA relationships

   To investigate the neurodevelopmental mechanisms underlying the
   CNV-related variations in SA, we used a previously established method
   named “virtual ontogeny” (Fig. [82]1B). This method provides a
   cellular-level interpretation of the MRI-based phenotype by testing
   spatial correlations between inter-regional profiles of CNV effects on
   SA and cell-specific gene expressions in mid-gestation across the 11
   cortical regions^[83]3. More generally, similar approaches that test
   spatial correlations between MRI matrices and molecular maps across the
   cerebral cortex are used increasingly to explore biological
   underpinnings of various MRI-based findings^[84]30. In virtual
   ontogeny, based on a single-cell RNA-sequencing study^[85]31, we
   selected 200 gene markers for each of 8 cell types present in the fetal
   cerebral cortex (6–22 gestational weeks): radial glia, intermediate
   progenitor cells (IPC), microglia, oligodendrocyte progenitor cells
   (OPC), excitatory and inhibitory neurons, mural cells and endothelial
   cells. The expression of these cell-specific genes in each of the 11
   regions during mid-gestation (12–22 postconceptional weeks) was
   obtained from the PsychENCODE bulk RNA-sequencing dataset^[86]32. We
   then tested the spatial correlation between interregional profiles of
   CNV-SA effect size and the expression profile of genes for each of the
   eight cell types. To simplify the interpretation of virtual ontogeny,
   we used absolute values of the effect sizes such that higher values
   indicate larger CNV-related variations in SA.

   The virtual ontogeny revealed that cortical regions with higher
   expression of genes specific to radial glia, IPC, and microglia during
   mid-gestation, including OFC, MFC, and V1C, showed stronger effects of
   genomic deletions on SA. On the other hand, regions with higher
   expression of genes specific to the majority of neuronal and glial cell
   types showed weaker effects of CNV deletions (Fig. [87]1D,
   Supplementary Fig. [88]5, and Supplementary Data [89]4). For
   duplications, the CNV-SA effect size profile showed no spatial
   correlations with progenitor cells (radial glia, IPCs), and opposite
   correlations with the differentiated cell types, as compared with
   deletions. These results were validated by four sensitivity analyses:
   (1) 100 and 300 gene markers specific for each of the eight cell types,
   (2) bootstrapped estimation for the distribution of the mean
   correlation, (3) gene-set enrichment analysis (Supplementary Fig. [90]6
   and Supplementary Datas [91]4–[92]6), and (4) gene markers specific to
   16 cell types derived from another single-cell transcriptomic dataset
   for mid-gestational cerebral cortex^[93]33 (Supplementary Fig. [94]7).”

Proximal and distal genes in network of progenitor genes

   To investigate which of the deleted/duplicated genes contribute to the
   association between genome-wide CNV burden and SA, we were inspired by
   the omnigenic model, which proposes that a large number of genes
   expressed in trait-relevant tissues contribute to the heritability of
   the trait through regulatory networks of core genes^[95]34.
   Accordingly, we hypothesized that genes that are highly co-expressed
   with neural progenitor genes (core) in the gene network have a stronger
   influence on cortical expansion than those distal to this core set of
   genes. To test this hypothesis, we divided all deleted and duplicated
   genes in our dataset into two groups based on their levels of spatial
   coexpression with neural progenitor genes (radial glia and IPC genes)
   across the 11 regions in the fetal cerebral cortex. Genes with high
   coexpression were labeled “proximal”, while those with low coexpression
   were labeled “distal” (Fig. [96]2A). To ensure a balanced sample size
   for CNV carriers with proximal/distal genes affected, we used the
   median coexpression value as a threshold (Supplementary Fig. [97]9,
   Supplementary Data [98]2 and Supplementary Fig. [99]11 for results
   obtained with a sliding-window approach). We then tested the
   relationship between the number of deleted/duplicated proximal and
   distal genes and SA of the 11 regions, as well as the total SA, while
   controlling for the effects of the other group of genes and genes
   without expression data.

Fig. 2. Effects of genes on cortical surface area in relation to coexpression
with neural progenitor genes.

   [100]Fig. 2
   [101]Open in a new tab

   A Illustration of defining proximal and distal genes. The
   proximal/distal categorization was based on coexpression levels with
   neural progenitor genes across the 11 cortical regions. “Proximal”
   denotes genes with high coexpression, while “distal” refers to genes
   with low coexpression. B The effect of proximal/distal gene deletions
   on SA (N = 39,015). Deletions of proximal genes, but not distal genes,
   correlated with smaller SA. The centered symbols represent the
   estimated effect size for each region. Error bars represent the
   standard error of the estimated effect sizes for each region. C
   Physical protein-protein interactions among the proximal genes. Node
   size corresponds to the number of interacting partners within the
   network. The names of genes with more than 19 interactions are shown. D
   Calculation of eQTL score for each proximal gene. For participants from
   the UKBB, we derived the eQTL score of a gene based on the gene’s eQTLs
   identified in the fetal cerebral cortex. E 54 proximal genes showed
   nominally significant correlations between their eQTL scores and SA. F
   Stronger relationships between eQTL scores and SA were found in the
   multimodal compared with the unimodal cortical regions. G The
   expression of neural progenitor genes in the multimodal and the
   unimodal cortices. There are 200 genes for radial glia and IPC,
   respectively. Centerline, median; box limits, upper and lower
   quartiles; whiskers, 1.5x interquartile range; points, outliers.

   We found that, on average, the deletion of one proximal gene was
   associated with a 0.0512 SD reduction in total SA (equal to about
   8 cm^2 in our dataset, p = 1.34E-03), and a mean 0.0385 SD reduction
   across the 11 regions (p = 3.42E-06, Fig. [102]2B). Deletions of the
   proximal genes had larger effects in “multimodal” regions, including
   DFC, VFC, OFC, MFC, ITC, STC and IPC which subserve complex perceptual
   and cognitive processes, than in “unimodal” regions (including M1C,
   S1C, A1C, V1C; Fig. [103]2B and Supplementary Data [104]7). Deletions
   of distal genes had no effect on SA (total SA: beta = 0.0027,
   p = 0.8695; across 11 regions: mean beta = 0.0027, p = 0.593; distal
   vs. proximal genes: p = 3.5E-04. Supplementary Data [105]7). An
   exception to this pattern was observed for V1C, where lower SA was
   associated with deletions of distal genes (beta = − 0.041, p = 0.044)
   but not of proximal genes (beta = − 0.025, p = 0.201). Note that V1C
   was the only cortical region with a significant effect of distal genes
   on SA (Fig. [106]2B).

   Among the 579 deleted proximal genes, 448 had at least one physical
   protein-protein interaction with other proximal genes; the most
   interacting genes were involved in DNA replication and cell cycle, such
   as H4C1 and CENPA (Fig. [107]2C). Pathway enrichment analysis based on
   these interacting proximal genes identified a number of biological
   processes, with the top pathways involving DNA replication and repair
   (q value = 5.68E-06), metabolism of co-factors and vitamins (q
   value = 9.39E-06), and cell growth and death (q value = 1.12E-05,
   Supplementary Dataset [108]8). No pathway enrichment was found for the
   479 interacting distal genes (Supplementary Fig. [109]10), suggesting a
   heterogeneous nature of this group of genes. We found no
   duplication-related effects on SA for either proximal or distal genes
   (Proximal: total SA: beta = − 0.0042, p = 0.6406; across the 11
   regions: mean beta = − 0.0049, p = 0.074. Distal: total SA:
   beta = − 0.0046, p = 0.5799; across the 11 regions: mean
   beta = − 0.0004, p = 0.778; vs. proximal genes: p = 0.321. Figure
   [110]2B and Supplementary Data [111]7). This is not surprising given
   that our virtual ontogeny analysis indicated no spatial correlation
   between the duplication-related variations in SA and expression of
   genes specific to proliferative cell types.

Expression quantitative trait loci of genes and surface area

   In CNV studies, elucidating the impact of individual genes within
   multigenic variants poses a challenge. Next, we aimed to answer two
   questions: (1) whether the prenatal expression of individual genes
   belonging to the above proximal group predicts cortical SA in the adult
   brain; (2) as motivated by the above results (Fig. [112]2B), whether
   the expression of these genes has stronger effects on the multimodal
   regions compared with the unimodal regions. To do so, in 30,083 CNV
   non-carriers from the UKBB, we computed an eQTL score as a proxy for
   each proximal gene’s expression in the fetal cerebral cortex
   (Fig. [113]2D). We then assessed the correlation between each gene’s
   eQTL score and SA across the 11 regions using a mixed-effect model.
   Among 543 proximal genes with eQTL scores, 54 showed nominally
   significant correlations between their eQTL scores and SA; 4/54 genes
   (HADHB, PRRT2, TMEM44, CORO1A) survived FDR correction (Fig. [114]2E
   and Supplementary Data [115]9). In a replication sample of 16,212
   individuals from UKBB, two of the four genes showed significant
   correlation, in the same direction, between their eQTL scores and SA
   (HADHB: beta = 0.012, p = 0.007; CORO1A: beta = 0.01, p = 0.022; PRRT2:
   beta = 0.006, p = 0.154; TMEM44: beta = 0.003, p = 0.508). In addition,
   the effect sizes (betas) of the 54 genes in the replication dataset
   were correlated with those in the discovery dataset (r = 0.454,
   p = 5.64E-04; Supplementary Fig. [116]13).

   To answer the second question, we compared the effect sizes of the
   relationships between the 54 genes’ eQTL scores and SA of the
   multimodal and of the unimodal regions. The correlations between eQTL
   scores and SA were stronger in multimodal (vs. unimodal) regions
   (Fig. [117]2F. Paired t test: t = − 2.23, df = 53, p = 0.030).
   Furthermore, to explore the underlying reasons for such
   unimodal-multimodal differences, we compared the expression of neural
   progenitor genes in the multimodal and the unimodal regions during
   mid-gestation. During this developmental period, the multimodal regions
   showed higher expression of genes specific to radial glia and IPC
   compared with the unimodal regions (Fig. [118]2G. radial g
   p = 2.90E-20; IPC: p = 3.26E-05). But note that we did not replicate
   the difference in the eQTL-SA relationships between unimodal and
   multimodal regions in the replication sample (t = 0.367, p = 0.715).

Specific CNVs and surface area

   Previous research has reported large effects on SA of specific CNVs,
   including 1q21.1, 16p11.2, and 22q11.2, known to increase the risk of
   neurodevelopmental disorders^[119]21. Yet the underlying mechanisms and
   contributing genes of these multigenic variants remain elusive. Next,
   we investigated neurodevelopmental mechanisms underlying these
   CNV-related variations in SA using the same methodology as for
   genome-wide CNVs. First, we estimated the effect size of specific CNVs,
   namely 1q21.1, 16p11.2, 22q11.2, on SA by comparing normalized SA of
   the same 11 regions between CNV carriers (deletion or duplication) and
   non-carriers from clinical and community cohorts (Supplementary
   Data [120]10). Compared with the non-carriers, deletions of 1q21.1 and
   22q11.2 had smaller SA across the 11 regions (1q21.1: − 1.04 SD,
   p = 5.23E-07; 22q11.2: − 0.57 SD, p = 1.40E-03), while duplications of
   1q21.1 and 22q11.2 showed larger SA (1q21.1: 0.45 SD, p = 5.14E-05;
   22q11.2: 0.18 SD, p = 8.26E-03. Figure [121]3A and Supplementary
   Data [122]3). In contrast, for 16p11.2, deletions displayed larger SA
   (vs. non-carriers: 0.39 SD, p = 2.59E-03), while duplications showed
   smaller SA (− 0.54 SD, p = 8.77E-05).

Fig. 3. Neurodevelopmental CNVs and cortical expansion.

   [123]Fig. 3
   [124]Open in a new tab

   (A) Case-control differences in normalized SA between CNV
   deletion/duplication at 1q21.1 (N deletion/duplication = 36/30),
   16p11.2 (N deletion/duplication = 82/75), and 22q11.2 (N
   deletion/duplication = 69/19) and the control (CNV noncarriers,
   N = 347). The error bars represent the standard error of estimated
   group differences for each region. B Virtual ontogeny through the lens
   of the correlation between CNV effects on SA and expression of
   cell-specific genes. The asterisk denotes a significant spatial
   correlation between the group differences in SA and the expression of
   cell-specific genes (p < 0.05 after FDR). C Coexpression scores between
   genes within the three CNVs and genes specific to radial glia and IPC.
   Coexpression scores represent the mean spatial correlations between
   prenatal expression of a CNV gene and 200 cell-specific genes and thus,
   range from − 1 to 1. The color and size of squares indicate the
   magnitude of coexpression. D Effect size of correlations between CNV
   genes’ eQTL score and SA of the 11 regions.

   Second, using virtual ontogeny, we found that deletions of 1q21.1 and
   22q11.2 and duplications of 16p11.2 and 22q11.2 showed results similar
   to genome-wide CNV deletion: regions with higher expression of
   progenitor genes (radial glia and IPC) in midgestation showed stronger
   effects of CNVs on SA (Fig. [125]3B, Supplementary Fig. [126]6,
   Supplementary Fig. [127]8 and Supplementary Datas [128]4–[129]6). These
   findings suggest that the proliferation and differentiation of neural
   progenitor cells may be a common cellular mechanism underlying the
   CNV-SA relationships.

   Third, we reasoned that genes coexpressed with the radial glia and IPC
   genes in the fetal cortex could be the main contributors to the effect
   of 1q21.1, 16p11.2, and 22q11.2 CNVs on the cortical expansion. Among
   the 65 genes (out of the 88 genes within the three genomic loci) with
   available prenatal expression data, 50 genes were spatially coexpressed
   with both radial glia and IPC genes across the 11 regions during the
   midgestation (6/7 genes in 1q21.1, 19/24 genes in 16p.11.2, and 25/34
   genes in 22q11.2. Fig. [130]3C). We subsequently tested whether the
   expression of these 50 genes in the fetal cortex predicts SA of the
   adult brain by testing the correlation between each gene’s eQTL score
   and SA of the 11 regions in 30,083 CNV non-carriers from the UKBB.
   Among the 47 genes with eQTL scores, 6 showed significant correlations
   between their eQTL scores and the SA (YPEL3, PRRT2, MVP, SCARF2,
   SNAP29, THAP7. Figure [131]3D and Supplementary Data [132]11).

Discussion

   Leveraging current knowledge of developmental processes underlying
   cortical development, this genome-wide association study of rare CNVs
   provides novel insights into the genetic architecture of SA of the
   human cerebral cortex. We demonstrated that, in general populations,
   individuals with a higher number of deleted protein-coding genes have a
   smaller cortical SA. This effect was stronger (1) in cortical regions
   with higher mid-gestational expression of neural progenitor genes, and
   (2) for genes that were strongly co-expressed with neural progenitor
   genes. These findings were extended to common variants affecting
   transcription, showing that eQTLs of some of these co-expressed genes
   were associated with SA. Enrichment analyses of these co-expressed
   genes pointed to diverse pathways involved in cell proliferation,
   growth, and death, such as DNA replication/repair and metabolism of
   co-factors and vitamins. Altogether, our findings suggest a model
   wherein CNVs interfere with the tangential expansion of the human
   cerebral cortex by altering the expression of genes involved in neural
   proliferation in the fetal period.

   Our findings align well with the radial unit hypothesis, which was
   formulated based on experimental work on non-human primates^[133]12. In
   the case of the human cerebral cortex, we showed that cortical regions
   with greater expression of proliferative cells (radial glia and IPCs)
   in the fetal cerebral cortex were more affected by CNVs. We also
   revealed that deletions of genes that had strong (rather than weak)
   coexpression with neural progenitor genes were associated with smaller
   SA. These effects were stronger in multimodal vs. unimodal cortical
   regions. The latter finding was corroborated by utilizing eQTLs in a
   large sample of CNV non-carriers, showing that the associations between
   eQTLs and SA were again stronger in the multimodal compared
   with unimodal regions.

   In addition, as revealed by the enrichment analysis of the proximal
   genes within a protein-protein interactions network, these genes appear
   to be involved in crucial cell proliferation pathways, including
   regulation of the cell cycle, metabolism of co-factors, and vitamins
   involved in DNA synthesis and repairment. For example, the E2F-enabled
   inhibition of the pre-replication complex formation pathway is involved
   in regulating the initiation of DNA synthesis in the early cell cycle
   (G1/S-phase)^[134]35. The formyltetrahydrofolate biosynthesis pathway
   is crucial for the synthesis of tetrahydrofolic acid which acts as a
   cofactor essential for the synthesis or anabolism of amino acids and
   nucleic acids^[135]36. The Proliferating Cell Nuclear Antigen
   (PCNA)-Dependent Long Patch Base Excision Repair pathway is a key
   mechanism for removing damage-containing strands of DNA^[136]37.These
   results are consistent with previous reports indicating that common
   genetic variants linked to adult SA were enriched in regulatory
   activities in neural progenitor cells during fetal
   development^[137]15,[138]17. According to the radial unit hypothesis
   and related experimental work, the cortical surface area of a given
   region depends on the size of the neural progenitor population^[139]13.
   The self-renewing and amplification of radial glia and IPCs are
   critical for determining the size of the neuronal population: two
   radial glial cells, which can self-renew and produce self-amplifying
   IPCs, could generate more than 80 neurons following 8 rounds of
   cellular cycles^[140]11. Thus, subtle deviations or variations in the
   molecular processes underlying the division of progenitor cells may
   have a profound impact on the tangential expansion of the cerebral
   cortex, especially in the multimodal regions, which have high
   expression of neural progenitor genes during midgestation and prolonged
   developmental trajectory^[141]38.

   Furthermore, our findings suggest that genes not specific to neural
   progenitor cells may also influence cortical expansion, likely via
   intricate interactions within various biological networks. For example,
   most CNVs’ effects on SA had associations with genes specific to
   endothelial and mural cells, components of developing cortical blood
   vessels, which are key regulators of neuronal proliferation^[142]39. In
   addition, microglia, an integral component of cortical proliferative
   zones, have been found to regulate the pool size of neural progenitor
   cells in the fetal cortex: deactivating or eliminating microglia in the
   fetal cerebral cortex increased the number of neural precursor
   cells^[143]40,[144]41. In addition to the specific pathways described
   above, genes proximal to the neural progenitor genes were associated
   with broad biological pathways involving, for example, macromolecule
   metabolism and transport. The genes identified as key contributors to
   the effects of the three specific CNVs on SA are also involved in
   diverse biological processes with potential direct or indirect
   influences on neurogenesis, such as the regulation of DNA transcription
   (THAP7) cellular senescence (YPEL3), and cell adhesion (SCARF2).

   The above observations align with the concept of omnigenic model, which
   suggests that complex traits are largely influenced by genes without
   direct links to the trait^[145]34. In other words, since the normal
   tangential growth of the cerebral cortex requires coordination among
   various cellular processes implemented by other systems, genetic
   variants of non-neuronal processes can have downstream effects on
   cortical expansion. For instance, perturbations in vascular development
   can lead to abnormal processes during cortical expansion, including
   premature neuronal differentiation and mispositioning growth of neural
   progenitor cells^[146]39,[147]42–[148]44.

   It is important to note that the generalizability and interpretation of
   these findings are subject to certain limitations. Our study has
   focused on 11 cortical regions with available gene-expression data from
   the prenatal period. Although this limits the generalizability of the
   findings throughout the cerebral cortex, note that the 11 regions
   include a number of unimodal and multi-modal regions in all four
   cerebral lobes. Furthermore, we also included results with regard to
   the total surface area. Our main analyses were conducted using pooled
   data from adults (UKBB and SYS parents) and adolescents (IMAGEN, SYS).
   While the majority of the data were derived from adults, we believe
   that age at MRI has, at most, a minor influence on generalizing our
   findings. This is because the most significant cortical expansion
   occurs during the prenatal and early postnatal periods, reaching ~ 83%
   of adult size by 2 years of age^[149]45,[150]46. Moreover, the eQTL
   scores used in our analyses were calculated based on the effects of
   genotypes on gene expression during fetal development. Lastly, the
   current study focused on testing the explanatory power of the radial
   unit hypothesis in the context of the CNVs, leveraging transcriptomic
   data from the mid-gestational fetal cortex. We acknowledge that other
   biological process, such as dendritic arborization and myelination
   during the first two years of life, may also contribute to the
   CNV-surface area relationship.

   Taken together, this genome-wide association study of rare CNVs
   revealed a group of genes that can influence cortical expansion,
   providing mechanistic insights into the early development of the human
   cerebral cortex, and the potential impact of rare gene
   deletions/duplications on this process. Our findings highlight that
   deficits in neural progenitor proliferation could serve as a
   parsimonious theoretical framework to unify hypotheses concerning the
   molecular mechanisms underlying CNV-related brain abnormalities. We
   believe that this knowledge advances our understanding of how rare
   genetic variants may affect cognition and mental health.

Methods

Participants

   We included data from three community-based cohorts: the IMAGEN study
   (n = 1661), which recruited 2090 adolescents in four European
   countries; the Saguenay Youth Study (SYS, n = 1493), which included
   both children and parents from 486 families in Quebec, Canada, and the
   UK Biobank (UKBB, n = 35,861). Carriers of the three specific CNVs,
   namely 1q21.1, 16p11.2, and 22q11.2, were ascertained in several
   clinical cohorts and UKBB (Supplementary Data [151]10). These CNVs were
   selected because they are (1) the most frequently identified CNVs in
   the clinic and reported in our clinical cohorts, (2) known for their
   risk of neurodevelopmental disorders and detrimental effects on
   cognition, (3) have large effects on surface area^[152]21, and (4)
   relatively prevalent in the population^[153]19,[154]22. As described in
   previous study^[155]21,[156]47, carriers of these specific CNVs were
   selected based on following breakpoints according to the reference
   genome GRCh37/hg19: 16p11.2 proximal (BP4-5, 29.6-30.2MB), 1q21.1
   distal (Class I, 146.4-147.5MB & II, 145.3-147.5MB), 22q11.2 proximal
   (BPA-D, 18.8-21.7MB). CNV carriers from clinical cohorts were referred
   to the genetic clinic either because of the presence of symptoms or
   because they were relatives of these symptomatic individuals. Control
   individuals from clinical cohorts do not carry any CNV at these genomic
   loci. Control individuals from UKBB were non-carriers matched for age
   and sex with CNV carriers from UKBB. Each cohort received ethical
   approvals from Local Research Ethics Committees.

CNV detection and annotation in community-based cohorts

   For the three community-based cohorts, we called CNVs using
   PennCNV^[157]48 and QuantiSNP^[158]49 with the same parameters as
   previously published: number of consecutive probes for CNV detection
   ≥ 3, CNV size ≥ 1 Kb, and confidence scores ≥ 15^[159]25. To minimize
   false discoveries, we merged CNVs detected by PennCNV and QuantiSNP
   with CNVision. We then applied a CNV inheritance analysis algorithm to
   concatenate adjacent CNVs of the same type, separated with a gap lower
   than 150 kb. Subsequently, we selected CNVs for analyses based on the
   following criteria: confidence score ≥ 30 with at least one detection
   algorithm, size ≥ 50 kb, and < 50% overlap with segmental duplication.
   A detailed description of CNV calling is available in
   the [160]Supplementary Methods. Note that the entire genome was
   searched without any pre-selection criteria beyond the three parameters
   reported above. No CNV regions were pre-selected based on gene
   characteristics, such as known neurodevelopmental effects or clinical
   significance.

   Detected CNVs were then annotated using Gencode V19 (hg19) with
   [161]ENSEMBL. We only considered genes with all transcripts (UTRs,
   start and stop codons, exons, and introns) fully encompassed within a
   CNV. Our dataset included a total of 5822 genes that were deleted or
   duplicated at least once. These genes were then annotated by the
   inverse loss-of-function observed/expected upper-bound fraction
   (1/LOEUF) using gnomAD, version 2.1.1.^[162]28. The 1/LOEUF score is
   available for 5525 genes and ranges from 0.5 (tolerant to
   loss-of-function variants) to 27.0 (intolerant to loss-of-function
   variants). We calculated a CNV burden score by summing up 1/LOEUF
   scores of all genes encompassed in the deleted or duplicated CNVs for
   each individual. Individuals without CNVs or with no genes encompassed
   in any CNVs were assigned a score of 0. Sensitivity analysis with the
   number of genes deleted/duplicated as a CNV burden score was done
   (Supplementary Fig. [163]4).

Regional surface area from magnetic resonance imaging data

   All T1-weighted (T1w) MRI data (1.0-1.1 mm isotropic resolution) were
   processed with FreeSurfer. A detailed description of MRI acquisition
   and processing is available in [164]Supplementary Methods and
   Supplementary Data [165]12. Briefly, T1w scans were processed through
   an automated cortical reconstruction pipeline in FreeSurfer
   (‘recon-all’). Subsequently, the cortical reconstruction outputs were
   used to extract SA based on a parcellation scheme using FreeSurfer. We
   then derived SA for the 11 cortical regions that intersect with
   prenatal tissue sampling from the PsychENCODE Consortium. Further
   details are available in the [166]Supplementary Methods.

Estimating effects of CNVs on regional surface area

   To estimate the effects of genome-wide CNV burden on SA in the general
   population, we pooled the IMAGEN, SYS, and UKBB data together and
   performed linear mixed-effect models with Restricted Maximum
   Likelihood. In the model, the CNV burden score was the independent
   variable of interest, and the SA (Z-score) of each of the 11 cortical
   regions was the dependent variable. Sex, age, scan site, and first 10
   genetic principal components were included as fixed effects, and the
   family ID was the random effect to account for potential confounders.
   The effect size (beta) estimation with the model was done for deletion
   and duplication separately. Thus, we obtained one effect size (beta)
   profile (across 11 regions) for genome-wide CNV deletion and
   duplication, respectively. Subsequently, we tested the significance of
   the effect size (beta) profiles across the 11 regions using a
   permutation approach. Specifically, the permutation test included the
   following steps: (1) Permuting the SA of the 11 regions across
   individuals while preserving the inter-regional associations; (2)
   Calculating betas between the resampled SA and the CNV burden score to
   obtain a simulated effect-size profile using corresponding models; (3)
   Repeating steps 1 and 2 10,000 times and then averaging the simulated
   profiles to derive a null profile; (4) Using paired t test to test if
   the observed effect-size profile is different from the null profile.

Virtual ontogeny

   To investigate developmental mechanisms underlying the CNV-related
   variations on cortical expansion at the cellular level, we tested the
   spatial relationships between the profile of CNV-SA effect sizes and
   expression of cell-specific genes in the fetal cortex, as introduced in
   our previous work^[167]3. We first selected 200 genes with highest
   specificity for each of 8 cell types in the human cerebral cortex
   during the first and second trimesters based on single-cell
   RNA-sequencing data (5 donors, 6–22 gestational weeks)^[168]31. The
   specificity of a gene for a cell type was defined by the proportion of
   the gene’s expression in that cell type relative to the total
   expression across all cell types^[169]50. Sensitivity analyses with the
   top 100 and 300 cell-specific genes, genes specific to 16
   mid-gestational cortical cell types from another single-cell
   transcriptomic dataset^[170]33 were also conducted (Supplementary
   Data [171]4 and [172]Supplementary Methods). Prenatal expression levels
   of these cell-specific genes across 11 regions were obtained from the
   PsychENCODE bulk RNA sequencing dataset (14 donors, 10–22
   post-conception weeks; [173]http://development.psychencode.org/). We
   filtered out genes with low expression (defined as less than 0.5 RPKM
   mean expression) and adjusted for donor-specific covariates such as
   sex, hemisphere, RNA integrity number, ethnicity, and sequencing site.
   Then the median gene expression per cortical region across all 14
   donors was used for the subsequent analysis. A detailed description of
   the processing of single-cell and bulk RNA sequencing data is available
   in the [174]Supplementary Methods and previous
   publications^[175]3,[176]32,[177]50.

   We then tested the spatial correlation between the inter-regional
   expression profiles of genes specific to a cell type and the profile of
   CNV-SA effects using a permutation test. Specifically, for a given cell
   type, we calculated the bicor between the cell-specific genes and
   CNV-SA effects across the 11 regions. Bicor is median-based rather than
   mean-based and, thus, is less sensitive to outliers than other
   correlation methods. The mean of the resulting 200 correlation
   coefficients was then tested for significance using a permutation
   approach. The null hypothesis postulates that the cell-specific genes
   have the same correlations with the CNV-SA effect profile as any random
   set of genes. Thus, in the permutation test, we randomly selected 200
   genes not specific to the cell type and calculated the mean of
   correlations between these genes’ expression and the CNV-SA effect
   profile. This was repeated 10,000 times to generate a null
   distribution. The observed mean correlation coefficient for the cell
   type was compared against the null distribution using a two-sided test
   with FDR for the 8 cell types ([178]Supplementary Methods).

   In addition, we also performed two different sensitivity analyses: (1)
   bootstrapped estimation for the distribution of the mean correlation
   and (2) gene-set enrichment analysis. The bootstrap analysis aimed to
   obtain 95% confidence intervals for the mean correlation between the
   profile CNV-SA effects and the expression of cell-specific genes and
   tested if the distribution differed significantly from zero. For each
   cell type, we bootstrapped the 200 cell-specific genes as well as the
   14 donors to account for the uncertainty related to the limited sample
   of donors. The gene-set enrichment analysis was done with R package
   “clusterprofile”^[179]51 to test if the cell-specific genes have
   significant correlations with the CNV-SA effect profile than the other
   genes available in the bulk RNA dataset. Further details of these
   sensitivity analyses can be found in the [180]Supplementary Methods.

Proximal and distal genes in network of progenitor genes

   We used prenatal expression data from PsychENCODE to estimate spatial
   coexpression (across the 11 cortical regions) between each of the
   deleted/duplicated genes within the genome-wide CNVs and genes specific
   to radial glia and IPC in the fetal cortical cortex. There were 3646
   out of 5822 CNV genes with prenatal expression data. For each gene, we
   calculated the mean bicor between the expression profile of the gene
   and the expression profiles of 200 genes specific for radial glia and
   IPC, respectively. The coexpression with the neural progenitor cells
   was defined as the absolute value of the average coexpression between
   the gene and genes specific to radial glia and IPC. Thus, in this
   unsigned weighted network, a strong connection/link means that a gene
   has strong coexpression, either negative or positive, with neural
   progenitor cell genes in the midgestational fetal cortex. Note that
   genes showing strong coexpression, either negative or positive, had a
   larger effect size on SA compared with genes with weak coexpression
   (Supplementary Fig. [181]12).

   We divided the 3646 genes into 2 groups based on the median of their
   coexpression values (= 0.193). We termed the genes with higher and
   lower coexpression as “proximal” and “distal” genes, respectively. Then
   for each participant from the community cohorts, we obtained the counts
   of deleted or duplicated genes belonging to the proximal, distal, and
   other genes (genes without expression data). There were 843 and 1239
   participants with at least one proximal and one distal gene deleted,
   respectively. A total of 5192 and 5112 participants had at least one
   proximal and one distal gene duplicated, respectively. We used linear
   models to estimate the effect size of proximal and distal genes on the
   SA of the 11 regions and the total SA of the whole cerebral cortex
   (Z-scored). Since many CNVs encompass more than one gene, we included
   the counts of proximal and distal genes as the independent variable of
   interest and the count of other genes as one of the confounders (sex,
   age, scan site, first 10 genetic principal components). This was done
   for deletion and duplication separately. The significance of the effect
   size (beta) profiles across the 11 regions for proximal and distal
   genes and differences between the two profiles was tested using t
   tests.

   In addition to the above dichotomous approach, we also used a sliding
   window approach to obtain a nuanced picture of the relationship between
   genes’ proximal/distal position to neural progenitors and their effects
   on cortical expansion. We first divided the 3646 genes deleted and/or
   duplicated in our dataset into 20 subgroups based on percentiles of
   their coexpression with neural progenitor genes. We then included the
   first four subgroups as the first “window” and moved the window forward
   by a step size of one subgroup. This resulted in 17 windows in total,
   with Window 1 having the lowest coexpression (mean = 0.034) and Window
   17 having the highest coexpression (mean = 0.428) with the neural
   progenitor genes. On average, there were 241 and 611 unique genes
   within a window deleted and duplicated at least once, respectively
   (Supplementary Fig. [182]9). We used this division approach to ensure
   that each window contained an equal number of genes, thus, making the
   number of participants with genes within a window deleted/duplicated
   comparable across windows, and having enough power to estimate effects
   of genes on SA. For a given window, we obtained the counts of deleted
   or duplicated genes belonging to the window, and outside the window for
   each participant from the community cohorts. We then estimated the
   effect of genes within the window on SA (11 regions and whole cerebral
   cortex), while controlling the effects of genes outside the window. On
   average, there were 479 and 1314 participants with at least one gene
   within a window deleted and duplicated, respectively (Supplementary
   Fig. [183]9).

Protein-protein interaction and pathway enrichment analysis

   Physical protein-protein interactions for the deleted proximal and
   distal genes were obtained from the Integrated Interactions Database
   (IID) version 2021-05 ([184]https://iid.ophid.utoronto.ca/)^[185]52.
   The submitted gene lists (i.e., 579 deleted proximal and 633 deleted
   distal genes) were mapped to proteins, and their protein-protein
   interactions were derived using the default settings of IID. Individual
   protein interaction networks were then loaded into NAViGaTOR 48 ver.
   3.0.17 for further annotation, analysis, and visualization^[186]53.
   There were 448 proximal and 479 distal genes that had at least one
   physical protein-protein interaction with other proximal and distal
   genes, respectively. Following, pathway enrichment analysis on these
   proximal/distal genes was conducted using pathDip, version
   5.0.32.3^[187]54.

Expression quantitative trait loci of genes and surface area

   To investigate which of the proximal genes level of expression in the
   fetal cortex was associated with the SA, we used an eQTL score
   constructed from eQTLs of each proximal gene as the proxy for the
   gene’s expression level in the fetal cerebral cortex. The largest eQTL
   study available for this purpose (201 mid-gestational human cortices)
   provides the summary statistics of eQTLs for 15,925 tested
   genes^[188]55. In this dataset, the 543 (out of 579) deleted proximal
   genes have at least one eQTL (nominal p-value < 0.01). Then, for each
   proximal gene, we calculated the eQTL score by summing up genotypes of
   each CNV non-carrier from UKBB, weighted by the corresponding genotype
   effect size (beta) derived from the eQTL study (p-value < 0.01). The
   calculation of eQTL scores was done using PRSice-2 with default
   clumping parameters (kb = 250, r^2 = 0.1). A higher eQTL score of a
   gene indicates a genetic predisposition for higher expression levels of
   the gene in the fetal cerebral cortex. We then used linear mixed models
   to estimate the effect of these eQTL scores on the SA of each of the 11
   cortical regions in the UKBB participants, with sex and the first 10
   genetic principal components as covariates and participants as a random
   effect. Note that the SA and eQTL scores were normalized (Z-score)
   across individuals before performing the analyses. There were 54 genes
   that showed nominal significant correlations between their eQTL scores
   and SA. Following, we used the same mixed effect modal to obtain the
   effect size (standardized beta) of correlations between the eQTL scores
   of these 54 genes and SA of unimodal and multimodal regions,
   respectively. Then paired t-test was used to compare the absolute value
   of effect sizes of eQTL-SA correlation between unimodal and multimodal
   regions.

   To compare the expression of neural progenitor genes in the multimodal
   and the unimodal regions during midgestation, expression levels of
   radial glia and IPC genes for all midgestation donors (n = 14) in
   PsychENCODE were unit-scaled following adjustment for sex, hemisphere,
   RIN, ethnicity, and sequencing site. Differences in expression between
   the clusters were evaluated using a linear mixed effects model
   adjusting for the age of donors (in log2(days)), with random effects
   for donor id and gene id.

Mechanism of specific CNVs and surface area

   To estimate the effect size on SA of 1q21.1, 16p11.2, and 22q11.2 CNV
   deletions and duplications, we conducted a linear regression model for
   each CNV deletion and duplication, separately. In the model, the CNV
   status (1 for carrier, 0 for non-carrier) was the independent variable,
   and the normalized (Z-score) SA of each of the 11 cortical regions was
   the dependent variable. Age, sex, and data sources were included as
   covariates. Subsequently, we tested the significance of the effect size
   (beta) profiles across the 11 regions for each CNV using the same
   permutation approach as for the genome-wide CNVs with FDR across the
   six genetic variants (deletion/duplication x 3 CNVs). Virtual ontogeny
   analyses were conducted with the derived effect size profiles to
   neurodevelopmental mechanisms underlying these CNV-related variations
   in SA.

   To identify potential key genes contributing to the CNV-related
   variations in SA, we first used prenatal expression data from
   PsychENCODE to identify CNV genes that were spatially co-expressed with
   the radial glia and IPC genes in the fetal cortex. Note that there are
   88 genes within the three clinical CNVs: 8 in 1q21.1, 34 in 16p11.2,
   and 46 in 22q11.2. Out of these 88 genes, there were 65 genes (7 for
   1q21.1, 24 for 16p11.2, and 34 for 22q11.2) with prenatal expression
   data in the 11 regions. For each of the 65 CNV genes, we calculated the
   bicor between the expression profile of the CNV gene and the expression
   profiles of 200 genes specific for radial glia and IPC, respectively.
   Then the mean of the resulting 200 bicor was tested for significance
   using the similar permutation approach as described above (“Virtual
   Ontogeny”). Thus, the mean of correlations between the expression of
   200 randomly selected genes and the CNV gene’s expression profile was
   calculated. This was repeated 5000 times to generate a null
   distribution. The observed mean correlation coefficient for the cell
   type (radial glia or IPC) was compared against the null distribution
   using a two-sided test with FDR for the 65 CNV genes. There were 50
   genes spatially coexpressed with both radial-glia and IPC genes across
   the 11 regions during the midgestation. Following, we tested the
   correlations between the eQTL scores of 47 out of these 50 genes (3
   genes had no eQTL scores) and the SA of 11 regions with FDR correction.

Reporting summary

   Further information on research design is available in the [189]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [190]Supplementary Information^ (22.3MB, docx)
   [191]41467_2025_56855_MOESM2_ESM.pdf^ (9.1KB, pdf)

   Description of Additional Supplementary Files
   [192]Supplementary Datasets 1-12^ (604.4KB, xlsx)
   [193]Reporting Summary^ (3MB, pdf)
   [194]Transparent Peer Review file^ (2.1MB, pdf)

Source data

   [195]Source Data^ (976.7KB, xlsx)

Acknowledgements