Abstract

   Nonrecombining sex chromosomes, like the mammalian Y, often lose genes
   and accumulate transposable elements, a process termed degeneration.
   The correlation between suppressed recombination and degeneration is
   clear in animal XY systems, but the absence of recombination is
   confounded with other asymmetries between the X and Y. In contrast, UV
   sex chromosomes, like those found in bryophytes, experience symmetrical
   population genetic conditions. Here, we generate nearly gapless female
   and male chromosome-scale reference genomes of the moss Ceratodon
   purpureus to test for degeneration in the bryophyte UV sex chromosomes.
   We show that the moss sex chromosomes evolved over 300 million years
   ago and expanded via two chromosomal fusions. Although the sex
   chromosomes exhibit weaker purifying selection than autosomes, we find
   that suppressed recombination alone is insufficient to drive
   degeneration. Instead, the U and V sex chromosomes harbor thousands of
   broadly expressed genes, including numerous key regulators of sexual
   development across land plants.

INTRODUCTION

   Sex chromosomes arise when an ordinary pair of autosomes gains the
   capacity to determine sex ([80]1). A defining characteristic of sex
   chromosomes is suppressed recombination in the heterogametic sex. It is
   widely believed that this lack of meiotic recombination makes natural
   selection less effective, predisposing nonrecombining chromosomes, like
   the mammalian Y, to degeneration and gene loss ([81]2, [82]3). However,
   although some nonrecombining chromosomes rapidly degenerate, or are
   completely lost, the sex chromosomes in other groups remain homomorphic
   or expand ([83]2). This diversity of form and gene content suggests
   that the role of suppressed recombination in the long-term trajectory
   of sex chromosome evolution must be modulated by other processes
   related to the life history of the organism. Identifying these
   important processes requires comparative analyses across multiple
   eukaryotic lineages.

   Many organisms, including bryophytes, algae, and some fungi, have a
   haploid UV sex chromosome system, in which females inherit a
   nonrecombining U and males inherit a nonrecombining V ([84]4, [85]5).
   The sex-specific transmission pattern of both chromosomes means that
   factors that are confounded in XY or ZW systems, such as suppressed
   recombination, hemizygosity, and sex-limited inheritance, are
   independent on UV chromosomes ([86]4–[87]6). Many UV sex chromosome
   systems may be ancient ([88]5), providing ample time for degenerative
   processes to act. However, the structural complexity of sex chromosomes
   has precluded genomic analyses in UV systems. Here, we evaluate the
   relative roles of gene gain and degeneration in shaping the evolution
   of the bryophyte UV sex chromosomes using nearly-gapless,
   chromosome-scale female and male genomes of the moss Ceratodon
   purpureus.

RESULTS

Assembly of C. purpureus female and male genomes

   Ancestral-state reconstructions of dioecy suggest that sex chromosomes
   evolved early in the history of the extant mosses ([89]7). To
   reconstruct the evolutionary history of the bryophyte UV sex
   chromosomes, we assembled and annotated chromosome-scale genomes of GG1
   (female) and R40 (male) C. purpureus isolates. Although the C.
   purpureus genome is relatively small, the sex chromosomes are large and
   have extensive repeat content, making them a challenge to assemble
   ([90]8), particularly with short-read technologies, which often do not
   span a whole repeat. We therefore used a combination of Illumina,
   bacterial artificial chromosomes (BACs), PacBio, and Dovetail Hi-C
   (figs. S1 and S2 and tables S1 to S4). The version 1.0 genome assembly
   of R40 comprises 358 Mb in 601 contigs (N50 1.4 Mb), with 98.3% of the
   assembled sequence in the largest 13 pseudomolecules, corresponding to
   the 13 chromosomes in its karyotype ([91]9). The version 1.0 GG1
   assembly is 349.5 Mb in 558 contigs (N50 1.4 Mb), with 97.9% of
   assembled sequence in the largest 13 pseudomolecules. Using more than
   1.5 billion RNA sequencing (RNA-seq) reads for each of the genome lines
   (GG1 and R40) and additional de novo assemblies of other C. purpureus
   isolates (table S5), we annotated 31,482 genes on the R40 assembly and
   30,425 on GG1 [BUSCO v3.0 of 69% using Embryophyte; 96.7 and 96.4%,
   respectively using Eukaryote; values similar to the moss Physcomitrium
   patens ([92]10)].

Identifying the moss ancestral chromosome elements

   To examine the conservation of genome architecture, we performed
   synteny analyses between the two C. purpureus genomes and the P. patens
   genome. GG1 and R40 were collected from distant localities (Gross
   Gerungs, Austria and Rensselaer, New York, USA, respectively) ([93]11),
   and we found that the assemblies had numerous structural differences
   ([94]Fig. 1). In the self-synteny analysis, we found clear homeologous
   chromosome pairs resulting from an ancient whole-genome duplication
   (WGD) ([95]Fig. 1 and fig. S2), consistent with previous transcriptomic
   ([96]11, [97]12) and our own Ks-based analyses (fig. S2 and table S6).
   We also identified abundant synteny between the C. purpureus and P.
   patens chromosomes, which diverged over 200 million years (Ma) ago
   ([98]Fig. 1) ([99]13). This result demonstrates that the ancestral
   karyotype of most extant mosses consisted of seven chromosomes
   ([100]13), which we refer to as ancestral elements A to G ([101]Fig.
   1), and suggests that major parts of the gene content of moss
   chromosomes are stable over hundreds of millions of years, similar to
   the “Muller Elements” in Drosophila ([102]14). Curiously, we could not
   detect the homeologs of the C. purpureus chromosomes 5 and 9 using
   synteny, an observation we return to below.

Fig. 1. Chromosome architecture in C. purpureus.

   [103]Fig. 1
   [104]Open in a new tab

   (A) Dot plot of syntenic orthogroup blastp hits between C. purpureus
   GG1 and R40 isolates, showing structural variation on autosomes and a
   lack of synteny across the sex chromosomes. (B) Self-synteny plot of C.
   purpureus R40 isolate showing homeologous chromosomes from a WGD. (C)
   Dot plot of syntenic orthogroup blastp hits between C. purpureus R40
   and P. patens, highlighting the seven ancestral chromosomes that we
   refer to as the moss ancestral elements A to G. (D) Density plots
   across C. purpureus chromosomes (in megabases). Densities show the
   proportion of a 100-kb window (90-kb jump) of each feature. Local
   density peaks of RLC5 Copia elements (purple Copia peaks) on each
   chromosome represent candidate centromeric regions, similar to P.
   patens ([105]13).

Suppressed recombination causes weak degeneration but not gene loss

   The major exception to the long-term genomic stability observed in C.
   purpureus was the sex chromosomes, which also share no obviously
   syntenic regions with each other or the autosomes ([106]Fig. 1). The
   sex chromosomes are ~30% of each genome (110.5 Mb on the R40 V, 112.2
   Mb on the GG1 U; [107]Fig. 1), four times the size of the largest
   autosome. The size is largely attributable to an increase in
   transposable elements (TEs), which comprise 78.2 and 81.9% of the U and
   V, respectively, similar to the nonrecombining Y or W sex chromosomes
   in other systems ([108]15), but far more than the C. purpureus
   autosomes [mean (μ), 46.4%; Mann-Whitney U with Benjamini and Hochberg
   correction (MWU), autosomes to U or V P < 2 × 10^−16; [109]Fig. 1].
   While some TEs have a homogeneous distribution across all chromosomes
   (e.g., Copia; μ: autosomes = 0.8%, U = 1.3%, V = 1.2%; MWU, all
   pairwise comparisons P > 0.09), the U and V chromosomes are enriched
   for very different classes of repeats compared to each other and the
   autosomes. For example, the U was enriched for hAT (μ: autosomes =
   2.3%, U = 10.1%, V = 7.4%; MWU, all pairwise comparisons P < 1.5 ×
   10^−14) and the V was enriched in a previously undescribed superfamily
   of cut-and-paste DNA transposons, which we refer to as Lanisha elements
   (μ: autosomes = 1%, U = 1.2%, V = 5.8%; MWU, all pairwise comparisons P
   < 1 × 10^−4; [110]Fig. 1 and fig. S3). The distribution of repeats in
   C. purpureus and the physical proximity of the autosomes inferred from
   the Hi-C contact map (fig. S2) together highlight the enigmatic
   isolation of the sex chromosomes in the nucleus ([111]16).

   Unlike other nonrecombining sex chromosomes, neither the U nor the V
   shows signs of major degeneration beyond the increased TE density.
   Sex-linked genes used on average one more codon than autosomes
   [effective number of codons (ENC)], less frequently use optimal codons
   [frequency of optimal codon (fop)], have a loss of preferred GC bias in
   the third synonymous codon position (GC3s), and have a higher rate of
   protein evolution (dN/dS), all consistent with weaker selection (MWU,
   autosomes to U or V P < 6 × 10^−6 for all metrics; [112]Fig. 2).
   Although, notably, the U- and V-linked genes were not different from
   one another (MWU, ENC P = 0.8; fop P = 0.22; GC3s P = 0.18; dN/dS P =
   0.73), suggesting that transmission through one sex or the other has no
   detectable effect on purifying selection. Consistent with this
   observation, the U and the V have 3450 and 3411 transcripts,
   respectively, representing ~12% of the C. purpureus gene content. This
   stands in stark contrast to the nonrecombining mammalian Y chromosome,
   or even other UV systems, which typically contain an order of magnitude
   fewer genes, at most ([113]17–[114]19). These observations indicate
   that although suppressed recombination decreases the efficacy of
   natural selection, alone, it is insufficient to drive gene loss on
   nonrecombining sex chromosomes ([115]20).

Fig. 2. Molecular evolution of autosomal and sex-linked genes in C.
purpureus.

   [116]Fig. 2
   [117]Open in a new tab

   (A) Autosomal genes are significantly different from U- or V-linked
   genes in the ENC. (B) fop. (C) GC content of the third, synonymous
   codon (GC3s) and (D) protein evolution (dN/dS) (MWU, autosomes to U or
   V P < 6 × 10^−6 for all metrics, indicated by ***; numbers show means).
   However, U- and V-linked genes were not significantly different [MWU,
   ENC P = 0.8; fop P = 0.22; GC3s P = 0.18; dN/dS P = 0.73, indicated by
   NS (not significant)], suggesting weak but not significantly different
   degeneration on the U and V. For dN/dS, four U-linked genes and two
   V-linked genes fell above the given scale of the y axis.

Moss sex chromosomes are ancient but evolutionarily dynamic

   The lack of degeneration means that thousands of genes can be used to
   reconstruct a detailed history of gene gain on the C. purpureus UV sex
   chromosomes. Critically, the times to the most recent common ancestor
   between orthologous genes on the U and V chromosomes allow us to
   estimate a minimum age for the sex chromosome system. In principle, the
   evolution of sex linkage should mimic the effects of a gene
   duplication, with identical U- and V-linked clades that coalesce at the
   node where recombination between them ceased (fig. S4). To identify
   these nodes, we used a phylogenomic approach with stringent inclusion
   criteria. We built 744 gene trees, 402 with U- and V-linked homologs.
   We found that most genes became sex-linked in the C. purpureus lineage,
   after the divergence from Syntrichia princeps (μ Ks = 0.16; [118]Fig. 3
   and table S7). However, 13 U-V orthologous pairs diverged at the base
   of the Dicranidae (μ Ks = 0.85), and three pairs diverged before the
   split between the two diverse clades Bryidae and Dicranidae (μ Ks =
   1.64). The most ancient U-V divergence (a Zinc finger Ran binding
   protein of unknown function) was before the split between Buxbaumia
   aphylla and the remaining Bryopsida, ~300 Ma ago [based on previous
   fossil-calibrated, relaxed-clock analyses ([119]21)] (Ks = 2.8; fig.
   S4).

Fig. 3. Evolutionary history of moss and liverwort sex chromosomes.

   [120]Fig. 3
   [121]Open in a new tab

   (A) Capture events of genes on moss and liverwort sex chromosomes.
   Numbers indicate how many extant genes were captured at the indicated
   branch based on the topology of the tree. The capture events in mosses
   can be traced back to three ancestral elements (A, B, and D), where the
   oldest sex-linked genes were from ancestral element A and homeologous
   chromosomes from B and D fused to the sex chromosomes. (B) Ks of
   one-to-one U-V orthologs plotted on U and V sex chromosomes of C.
   purpureus. Lines connect the U-V orthologs, where colors correspond to
   the ancestral elements in (A). The darker brown lines indicate genes
   with Ks ≤ 0.02, presumably representing the most recently captured
   genes, which highlights the rapid rearrangement of genes on the sex
   chromosomes. These data, in addition to synteny ([122]Fig. 1), also
   suggest a lack of a pseudo-autosomal region between the C. purpureus U
   and V.

   It is possible that the Zinc finger gene duplicated before the split
   between B. aphylla and the remaining Bryopsida, and these duplicates
   each were independently captured by the sex-determining locus in these
   two lineages. However, a more parsimonious explanation is that the
   origin of the bryophyte sex chromosome system predated the divergence
   between B. aphylla and the remaining Bryopsida. This observation
   provides support for the maximum parsimony ancestral state
   reconstruction of sexual system of McDaniel et al. ([123]7) but pushes
   back the origins of dioecy in the common ancestor of the Bryidae and
   Dicranidae to before the origin of the arthrodontous mosses ([124]Fig.
   3). These data also provide an independent means to evaluate the
   inferred transition bias toward dioecy suggested by that analysis.

   A classic signature of gene capture on sex chromosomes is the presence
   of strata, where neighboring genes added in the same recombination
   suppression event have a similar Ks ([125]22). However, on the C.
   purpureus sex chromosomes, we found that Ks was not associated with
   gene order ([126]Fig. 3). Even genes with very low Ks, presumably from
   the most recent recombination suppression event, were found across the
   entirety of the U or V, meaning that gene order was shuffled soon after
   the evolution of sex linkage. To understand the mechanism by which the
   region of suppressed recombination acquires new genes, we combined
   inferences from phylogenomic analyses with the physical position of
   orthologs among the ancestral karyotypic elements. When we examined
   gene trees for the two most recent capture events, we found that the
   overwhelming majority are from ancestral elements D (~80% of C.
   purpureus–specific captures; table S7) and B (~92% of Dicranidae
   captures; table S7) indicating that the missing homeologous chromosomes
   to C. purpureus 5 and 9, respectively, had fused to the sex chromosomes
   ([127]Fig. 3), but the scrambling of gene order had rendered them
   undetectable using synteny alone.

   The mechanism by which the chromosome fusions occur in UV systems is
   not entirely clear. In other systems, the sex chromosomes have a
   pseudo-autosomal region (PAR) that is linked to the nonrecombining
   portion of the sex chromosome. Unlike the sex-limited region, the PAR
   pairs normally at meiosis, presumably to assure 1:1 segregation. The
   addition of new genes to both sex chromosome partners can proceed
   through the expansion of suppressed recombination into the PAR or the
   translocation of other genomic regions to the PAR. Curiously, we could
   find no trace of a PAR in the C. purpureus assemblies, nor in any of
   the unassembled scaffolds ([128]Figs. 1 and [129]3). We suspect that
   the PAR reported in C. purpureus genetic maps ([130]7, [131]8) reflects
   artifactual linkage generated by wide genetic crosses ([132]8,
   [133]23). Nevertheless, abundant genotyping of haploid spores from
   single diploid sporophytes strongly suggests that some PAR-independent
   mechanism enforces 1:1 segregation of the U and V sex chromosomes
   ([134]24). Thus, while small genomic regions could become independently
   incorporated into the U or the V, the translocation of whole
   chromosomes would likely result in homologous pairing between the
   neo-sex chromosome arms and ultimately the incorporation of the region
   into both the U and V.

   To extend the ancestral reconstruction to liverwort sex chromosomes, we
   generated gene trees using transcriptome data combined with previously
   identified sex-linked genes in Marchantia polymorpha ([135]18). Like in
   C. purpureus, we found no evidence of syntenic strata when we compared
   Ks between the U- and V-linked orthologs (table S8). We also found
   evidence of four liverwort-specific capture events, with the oldest
   diverging ~400 Ma ago, before the split of Marchantiidae and Pelliidae
   ([136]Fig. 3 and fig. S4) ([137]21). Our analyses show that most
   sex-linked genes in M. polymorpha (table S8), like two of the oldest
   genes in C. purpureus (table S7), have homologs from moss ancestral
   element A. This new insight leads to the remarkable suggestion that
   this element played a key role in sex determination early in the
   history of both lineages, ~500 Ma ago, making the Setaphyte sex
   chromosomes among the oldest known to date across Eukarya.

The C. purpureus sex chromosomes harbor broadly expressed, conserved
regulators of sexual development

   A key factor explaining the retention of transcripts on nonrecombining
   sex chromosomes is broad gene expression ([138]25, [139]26), which in
   plants includes the haploid phase. In transcriptomic data from multiple
   tissues, we found more than 1700 U- and V-linked genes expressed (mean
   count ≥ 1) (fig. S5 and tables S9 to S11), including essential
   components of the cytoskeleton (e.g., tubulin) and DNA repair complexes
   (e.g., RAD51). We found that the number of sex-biased autosomal genes
   (mean count ≥ 1, fold change ≥ 2, adjusted P ≤ 0.05) was far eclipsed
   by expressed sex-specific genes (i.e., those only on the U or V),
   suggesting that sex-linked loci contribute more to expression
   differences between the sexes than do autosomes (fig. S5). Furthermore,
   in contrast to data from gene-poor sex chromosome systems, we found
   that nearly all the genes in the female- and male-specific coexpression
   modules, including the hubs, were sex-linked (fig. S6 and table S12).

   The sex-specific gene expression networks are enriched for proteins
   with known reproductive functions across green plant lineages. For
   example, the male Gene Ontology (GO) and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) terms show enrichment for microtubule-based
   processes, which play a role in sperm production in other systems (fig.
   S6 and tables S13 and S14) ([140]27, [141]28). We also found that both
   female and male coexpression modules are enriched for genes involved in
   circadian rhythm, like phytochrome, which are involved in flower
   development in Arabidopsis thaliana ([142]29). The male coexpression
   module also contained a V-specific ABC1 gene orthologous to a V-linked
   copy in M. polymorpha (fig. S7), and genes in this family are involved
   with pollen development in angiosperms ([143]30). The female
   coexpression module contains a U-specific RWP-RK transcription factor
   (TF) orthologous to M. polymorpha MpRKD, which is a conserved component
   of the egg development pathway across land plants and are mating-type
   loci in green algae (fig. S7) ([144]17, [145]31, [146]32). Moreover,
   the cis-acting sexual dimorphism switch MpFGMYB ([147]33), which
   promotes female development in M. polymorpha, as well as the P. patens
   CRINKLY4 (PpCR4) gene, which functions in archegonial neck development
   (fig. S7) ([148]34), have orthologous U- and V-linked copies in C.
   purpureus (fig. S7).

   Several other TFs or transcriptional regulators (TRs) are found in the
   sex-specific coexpression modules (e.g., V-linked R2R3-MYB) or are only
   found on the U or V (e.g., HD DDT, Med7, and SOH1; table S15), together
   suggesting that candidate regulators of sex-specific developmental
   processes are enriched on the C. purpureus UV sex chromosomes. In
   addition, 187 U- and/or V-linked genes are homologous to over 250 A.
   thaliana genes with reproductive roles (fig. S5 and table S16). It is,
   of course, difficult to prove shared gene function across these diverse
   taxa, particularly for genes found in large gene families, and
   neo-functionalization can lead to altered functions in specific
   lineages. Nevertheless, complementing mutants of these genes in
   hermaphroditic species, like A. thaliana or P. patens, with sex-linked
   homologs from C. purpureus is likely to provide a powerful means to
   interrogate the evolution and function of sex-limited gene regulatory
   networks.

DISCUSSION

   Our analyses challenge the idea that suppressed recombination and
   sex-limited inheritance are sufficient to drive sex chromosome
   degeneration. Clearly, the lack of meiotic recombination both weakens
   purifying selection, which results in decreased codon bias and
   increased protein evolution, and facilitates massive structural
   variation and highly differentiated TE accumulation between the U and
   V. Like in other plants, haploid gene expression in C. purpureus
   apparently slows sex chromosome degeneration, even over millions of
   years of suppressed recombination ([149]25). However, unlike flowering
   plants, where hermaphroditism is the norm ([150]35), the antiquity of
   dioecy in bryophytes more closely mirrors the sexual systems in animals
   ([151]36, [152]37). Thus, the gene-rich C. purpureus sex chromosomes
   provide a powerful comparative tool for studying the long-term
   evolution of sex-limited gene regulatory networks that govern sexual
   differentiation.

MATERIALS AND METHODS

Isolate collection and tissue culture

   All C. purpureus tissue used in this study was isolated from a single
   spore ([153]24), from field-collected sporophytes (table S5) ([154]11,
   [155]38). In-depth methods for tissue generation for DNA and RNA,
   library preparation, and sequencing can be found in Supplementary
   Materials and Methods.

Genome assemblies

   We sequenced C. purpureus (var. GG1 and var. R40) using a whole-genome
   shotgun sequencing strategy and standard sequencing protocols.
   Sequencing reads were collected using Illumina, PacBio, and Sanger
   platforms. Illumina, PacBio, and Sanger reads were sequenced at the
   Department of Energy Joint Genome Institute in Walnut Creek, California
   and the HudsonAlpha Institute in Huntsville, Alabama. Illumina reads
   were sequenced using the Illumina HiSeq 2000 and X10 platform, and the
   PacBio reads were sequenced using the SEQUEL I platform. Sanger BACs
   were sequenced using an ABI 3730XL capillary sequencer. For both GG1
   and R40, one 400–base pair (bp) insert 2 × 150 Illumina fragment
   library (133.14× for GG1, 146.45× for R40) was sequenced along with one
   2 × 150 Dovetail Hi-C library (252.86× GG1, 442.71× R40) (table S1).
   Before assembly, Illumina fragment reads were screened for phix
   contamination. Reads composed of >95% simple sequence were removed.
   Illumina reads <50 bp after trimming for adapter and quality (q < 20)
   were removed. For the PacBio sequencing, a total of eight chemistry 2.1
   cells (10-hour movie time) were sequenced each for GG1 and R40 on
   Sequel 1 with a raw sequence yield of 39.82 Gb (GG1) and 46.24 Gb (R40)
   with a total coverage of 113.77× (GG1) and 132.11× (R40) (table S2).
   Last, a total of 1032 BAC clones sequenced with Illumina indexed
   libraries were used for patching the final chromosome gaps.

Genome assembly and construction of pseudomolecule chromosomes

   Improved versions 1.0 of the C. purpureus (var. GG1 and var. R40)
   assemblies were generated by separately assembling the 4,195,510 PacBio
   GG1 reads (113.77× sequence coverage) and 5,238,148 PacBio reads R40
   (132.11× sequence coverage) using the MECAT assembler ([156]39) and
   subsequently polished using QUIVER ([157]40). For GG1, this produced
   637 scaffolds (637 contigs), with a contig N50 of 1.2 Mb, 475 scaffolds
   larger than 100 kb, and a total genome size of 347.1 Mb (table S3). For
   R40, this produced 731 scaffolds (731 contigs), with a contig N50 of
   1.1 Mb, 497 scaffolds larger than 100 kb, and a total genome size of
   361.3 Mb (table S3).

   Hi-C scaffolding using the JUICER pipeline ([158]41) was used to
   identify misjoins in the initial MECAT assembly. Misjoins were
   characterized as a discontinuity in the GG1 or R40 linkage group. A
   total of 73 misjoins were identified and resolved in GG1 and 64 in R40.
   The resulting broken contigs were then oriented, ordered, and joined
   together into 13 chromosomes (12 autosomal and 1 sex chromosome
   designated as “U” in the GG1 release and 12 autosomal and 1 sex
   chromosome designated as “V” in the R40 release) using both the map and
   the Hi-C data. A total of 579 joins were made in GG1 and 625 in R40
   during this process. Each chromosome join is padded with 10,000 Ns.
   Significant telomeric sequence was identified using the (TTTAGGG)[n]
   repeat, and care was taken to make sure that it was properly oriented
   in the production assembly. The remaining scaffolds were screened
   against bacterial proteins, organelle sequences, and GenBank
   non-redundant database and removed if found to be a contaminant. For
   GG1, a set of 1032 BAC clones (107.8-Mb total sequence) sequenced with
   Illumina indexed libraries were used to patch remaining gaps in the
   chromosomes. Clones were aligned to the chromosomes using BLAT
   ([159]42), and clone contigs crossing gaps were used to form patches. A
   total of 35 gaps were patched.

   Last, homozygous single-nucleotide polymorphisms (SNPs) and insertions
   or deletions (INDELs) were corrected in the release consensus sequence
   using ~88× of Illumina reads (2 × 150, 400-bp insert) by aligning the
   reads using bwa mem ([160]43) and identifying homozygous SNPs and
   INDELs with the Genome Analysis Toolkit UnifiedGenotyper tool
   ([161]44). A total of 108 homozygous SNPs and 5291 homozygous INDELs in
   GG1 and 19 homozygous SNPs and 867 homozygous INDELs in R40 were
   corrected in the release. The final version 1.0 GG1 release contains
   349.5 Mb of sequence (1.3% gap), consisting of 558 contigs with a
   contig N50 of 1.4 Mb and a total of 97.9% of assembled bases in
   chromosomes. The final version 1.0 R40 release contains 358.0 Mb of
   sequence (1.2% gap), consisting of 601 contigs with a contig N50 of 1.4
   Mb and a total of 98.3% of assembled bases in chromosomes.

   Completeness of the euchromatic portion of the version 1.0 GG1 and 1.0
   R40 assemblies was assessed by aligning an RNA-seq library (library
   code GNGZB for GG1 and GNGZC for R40). The aim of this analysis is to
   obtain a measure of completeness of the assembly, rather than a
   comprehensive examination of gene space. The transcripts were aligned
   to the assembly using GSNAP ([162]45). The alignments indicate that
   96.88% of the GG1 RNA-seq reads aligned to the version 1.0 GG1 release
   and 97.01% of the R40 RNA-seq reads aligned to the version 1.0 R40
   release.

Construction of the scaffold assembly

   A total of 4,195,510 PacBio reads (113.77×) in GG1 and 5,238,148 PacBio
   reads (132.11×) in R40 were assembled using MECAT ([163]39) and formed
   the starting point of the version 1.0 release for each. The 310,662,272
   Illumina sequence reads (133.14× sequence coverage) in GG1 and
   353,932,084 Illumina sequence reads (146.45× sequence coverage) in R40
   were used for fixing homozygous SNP/INDEL errors in the consensus. A
   total of 310,662,272 Hi-C reads (252.86× sequence coverage) in GG1 and
   1,062,837,932 Hi-C reads (442.71× sequence coverage) in R40 were used
   for chromosome construction.

Screening and final assembly release

   Scaffolds that were not anchored in a chromosome were classified into
   bins depending on sequence content. Contamination was identified using
   blastn against the National Center for Biotechnology Information (NCBI)
   nucleotide collection (NR/NT) and blastx using a set of known microbial
   proteins. In GG1, additional scaffolds were classified as repetitive
   (>95% masked with 24-mers that occur more than four times in the
   genome) (16 scaffolds, 482.8 kb), chloroplast (1 scaffold, 158.7 kb),
   and low quality (>50% unpolished bases after polishing, 3 scaffolds,
   48.3 kb). In R40, additional scaffolds were classified as repetitive
   (>95% masked with 24-mers that occur more than four times in the
   genome) (12 scaffolds, 489.6 kb), chloroplast (1 scaffold, 50.2 kb),
   and low quality (>50% unpolished bases after polishing, 6 scaffolds,
   236.8 kb). Resulting final statistics are shown in table S4.

GG1 assessment of assembly accuracy

   A set of 17 finished contiguous Sanger BAC clones >100 kb were selected
   to assess the accuracy of the assembly. A range of variants were
   detected in the comparison of the BAC clones and the assembly. In 14 of
   the BAC clones, the alignments were of high quality (<0.05% base pair
   error) with an example being given in fig. S1. All dot plots were
   generated using Gepard ([164]46). The remaining three BACs indicate a
   higher error rate due mainly to their placement in more regions
   containing tandem repeats (fig. S1). The overall base pair error rate
   in the BAC clones is 0.016% (269 discrepant bp of 1,599,605 bp).

Genome annotations

   Transcript assemblies were made from ~1.5 billion pairs of 2 × 150
   stranded paired-end Illumina RNA-seq reads from C. purpureus GG1 and
   ~1.6 billion pairs from C. purpureus R40 using PERTRAN, which conducts
   genome-guided transcriptome short-read assembly via GSNAP ([165]47) and
   builds splice alignment graphs after alignment validation, realignment,
   and correction ([166]48), PERTRAN assemblies from G100m_X_G150f_Sporo
   reads on the C. purpureus GG1 or R40 genome, and filtered open reading
   frames (ORFs) from Trinity assemblies from stranded paired-end Illumina
   reads from additional C. purpureus cultivars (is Navarino, Magellanes,
   Chile; Durham, NC, USA; Otavalo, Ecuador; Renesselaer, NY, USA; and
   Storrs, CT, USA; table S5). 180,954 (GG1) and 194,414 (R40) transcript
   assemblies were constructed using the Program to Assemble Spliced
   Alignments (PASA) ([167]49) from RNA-seq transcript assemblies above
   and a bit of C. purpureus expressed sequence tags (ESTs). Loci were
   determined by transcript assembly alignments and/or EXONERATE
   alignments of proteins from A. thaliana ([168]50), soybean ([169]51),
   Setaria viridis ([170]52), grape ([171]53), Sphagnum magellanicum, and
   P. patens ([172]13), Selaginella moellendorffii ([173]54), and
   Chlamydomonas reinhardtii ([174]55), filtered Trinity assembly ORFs
   described above, high-confidence gene models from the first round of C.
   purpureus R40 gene call, UniProt Bryopsida, and Swiss-Prot proteomes to
   repeat-soft-masked C. purpureus GG1 genome using RepeatMasker ([175]56)
   with up to 2000-bp extension on both ends unless extending into another
   locus on the same strand. Repeat library consists of de novo repeats by
   RepeatModeler ([176]57) on C. purpureus GG1 genome and repeats in
   RepBase. Gene models were predicted by homology-based predictors,
   FGENESH+ ([177]58), FGENESH_EST (similar to FGENESH+, EST as splice
   site and intron input instead of protein/translated ORF), and EXONERATE
   ([178]59) and PASA assembly ORFs (in-house homology constrained ORF
   finder) and from AUGUSTUS via BRAKER1 ([179]60). The best-scored
   predictions for each locus are selected using multiple positive factors
   including EST and protein support, and one negative factor: overlap
   with repeats. The selected gene predictions were improved by PASA.
   Improvement includes adding untranslated regions, splicing correction,
   and adding alternative transcripts. PASA-improved gene model proteins
   were subject to protein homology analysis to abovementioned proteomes
   to obtain Cscore and protein coverage. Cscore is a protein BLASTP score
   ratio to MBH (mutual best hit) BLASTP score, and protein coverage is
   the highest percentage of protein aligned to the best of homologs.
   PASA-improved transcripts were selected on the basis of Cscore, protein
   coverage, EST coverage, and its coding sequence (CDS) overlapping with
   repeats. The transcripts were selected if its Cscore is larger than or
   equal to 0.5 and protein coverage is larger than or equal to 0.5, or it
   has EST coverage, but its CDS overlapping with repeats is less than
   20%. For gene models whose CDS overlaps with repeats for more than 20%,
   its Cscore must be at least 0.9 and homology coverage at least 70% to
   be selected. The selected gene models were subject to Pfam analysis,
   and gene models whose protein is more than 30% in Pfam TE domains were
   removed and considered weak gene models. Incomplete gene models, low
   homology supported without fully transcriptome-supported gene models of
   short single exons (<300-bp CDS) with neither protein domain nor good
   expression gene models were manually filtered out.

Synteny analysis within C. purpureus and between P. patens

   We ran the default GENESPACE pipeline ([180]48) with a minimum block
   size of 5 genes and a maximum gap/search radius of 15 genes. In short,
   GENESPACE runs Orthofinder ([181]61, [182]62) on synteny-constrained
   blastp hits. This offers higher stringency when exploring highly
   diverged genomes (or ancient WGDs) by removing high-scoring, but
   randomly distributed, blast hits.

Ks plot analysis to identify the C. purpureus WGD

   WGDs were detected with conventional Ks plot analyses. We used the wgd
   pipeline ([183]63). An all-by-all BLASTP search ([184]64) was performed
   for the C. purpureus GG1 and R40 genomes as well as P. patens and M.
   polymorpha. Paralogs were clustered with MCL ([185]65). For each
   cluster, all pairwise Ks estimates were obtained from PAM ([186]66)
   with the GY94 model with F3x4 equilibrium codon frequencies ([187]67).
   Hierarchical clustering was used to reduce redundant comparisons and
   obtain node-averaged Ks estimates. This process was repeated for
   syntenic paralogs too, which were obtained from I-ADHoRe v3.0 with
   default settings ([188]68) based on all-by-all BLASTP results.
   Orthologous gene divergences used reciprocal best BLASTP hits between
   C. purpureus and P. patens.

   Peaks in Ks plots can be identified visually, but we also applied
   mixture models that were selected by the difference in BIC scores, such
   that a difference less than 3.2 is used as a stopping criterion.
   Mixture models were implemented with the bic.test.wgd function
   available on GitHub ([189]https://github.com/gtiley/Ks_plots). Mixture
   models can be problematic in their interpretation because of
   overfitting; therefore, we looked for peaks that were consistently
   detected across models and the maximum Ks value allowed ([190]69). When
   analyzing all paralogs, a single prominent peak was observable in C.
   purpureus with a mean between a Ks of 0.65 and 0.97 in GG1 and a Ks
   between 0.68 and 0.74 in R40 (table S6). The more consistent results in
   R40 imply that more paralogs from this WGD event have survived on the V
   chromosome compared to the U chromosome. This WGD postdates the
   divergence of C. purpureus and P. patens (fig. S2). This is determined
   by visual inspection but agrees with previous analyses of WGD in both
   C. purpureus and P. patens ([191]11–[192]13). The presence of a single
   WGD that occurred in C. purpureus following divergence from P. patens
   is supported by analyses of syntenic paralogs as well (fig. S2), which
   suggests slightly more recent WGD ages (table S6). However, analyses of
   syntenic paralogs from P. patens supported the presence of two WGDs
   following divergence from C. purpureus (table S6), similar to previous
   findings when using syntenic data ([193]13) compared to all paralogs
   from genomic or transcriptomic data ([194]12, [195]70).

   Ks plot analyses are provocative of older WGD events that predate the
   divergence of C. purpureus and P. patens. Notably, low numbers of
   syntenic paralogs are evident between Ks of 3.0 and 4.0; although, the
   same is true for M. polymorpha that putatively has no history of
   ancient WGD. Any identifiable peaks in Ks plot analyses are too
   speculative given the lack of evidence from mixture models and nor does
   their existence affect our proposed model of karyotype evolution. It
   should be noted though that analyses of gene trees that reconcile
   duplication and loss events onto a species tree have implied a shared
   large-scale duplication event shared by C. purpureus and P. patens
   [“B3” ([196]12)] and an even older event shared by all mosses [“B2”
   ([197]12)]. Testing these ancient hypotheses is beyond the scope of Ks
   plot analyses, even with syntenic data. Rather, macrosyntenic evidence
   from more moss species, such as Sphagnum fallax, will be needed to
   identify the presence of expected syntenic ratios among genes, similar
   to the identifiable 1:4 ratios between C. purpureus and P. patens
   investigated here.

TE annotation

   We combined the R40 assembly (autosomes and V) with the U sex
   chromosome assembled from GG1 to run de novo repeat detection using the
   TEdenovo pipeline from the REPET package (v2.4) ([198]71). Parameters
   were set to consider repeats with at least five copies. We obtained a
   library of 4699 consensus sequences that was filtered to keep only
   those that are found at least once as a full-length copy in the
   combined assembly, and we retained 2523 of them. This library of
   consensus sequences was then used as digital probe for whole-genome
   annotation by the TEannot ([199]72) pipeline from the REPET package
   v2.4. Threshold annotation scores were determined for each consensus as
   the 99th percentile of the scores obtained against a randomized
   sequence [reversed input, not complemented and masked using Tandem
   Repeats Finder with parameters 2 7 7 80 10 70 10 ([200]73)]. The
   library of consensus sequences was classified using PASTEC followed by
   manual curation ([201]74). To improve classification, remote homology
   detection was performed using HH-suite3 ([202]75). For the density plot
   of genes and TEs ([203]Fig. 1), we calculated the proportion of
   coverage of each feature in a 100-kb window with a 90-kb jump using
   Bedtools (v2.27.) ([204]76). These results were plotted in R (v3.5.3)
   ([205]77) using the package karyoploteR (v1.8.8) ([206]78)
   (ceratodon_genome_plots.R,
   [207]https://doi.org/10.5061/dryad.v41ns1rsm). To examine differences
   in enrichment between the autosomes, U, and V, we ran a pairwise MWU
   for multiple tests ([208]79, [209]80) using the sliding window
   densities (n[Auto] = 2736, n[U] = 1247, n[V] = 1229).

TF and regulator annotation

   Transcription-associated proteins (TAPs) comprise TFs (acting in a
   sequence-specific manner, typically by binding to cis-regulatory
   elements) and TRs (acting on chromatin or via protein-protein
   interaction). We classified all C. purpureus proteins into 122 families
   and subfamilies of TAPs by a domain-based rule set ([210]81, [211]82).
   We compared this genome-wide classification with relevant organisms.
   All proteins in which a domain was found are listed with their family
   assignment. In cases when the domain composition does not allow an
   unambiguous assignment, they are assigned no_family_found.

Gene expression and coexpression

   Gene expression and coexpression analyses were done using three
   male-female sibling pairs (n[isolates] = 6, three of each sex) at
   gametophore and protonemal stages (n[stages] = 2) in triplicate
   (n[replicates] = 3) (table S5; see Supplementary Materials and Methods
   for details on tissue conditions). Raw reads were filtered for
   contaminants and adapters removed using BBDuk (v38.00) (Bushnell,
   [212]http://bbtools.jgi.doe.gov). This included removing reads with 93%
   identity to human, mouse, dog, or cat or aligning to common microbial
   references. Further filtering removed reads with any “N’s,” an average