Abstract Nonrecombining sex chromosomes, like the mammalian Y, often lose genes and accumulate transposable elements, a process termed degeneration. The correlation between suppressed recombination and degeneration is clear in animal XY systems, but the absence of recombination is confounded with other asymmetries between the X and Y. In contrast, UV sex chromosomes, like those found in bryophytes, experience symmetrical population genetic conditions. Here, we generate nearly gapless female and male chromosome-scale reference genomes of the moss Ceratodon purpureus to test for degeneration in the bryophyte UV sex chromosomes. We show that the moss sex chromosomes evolved over 300 million years ago and expanded via two chromosomal fusions. Although the sex chromosomes exhibit weaker purifying selection than autosomes, we find that suppressed recombination alone is insufficient to drive degeneration. Instead, the U and V sex chromosomes harbor thousands of broadly expressed genes, including numerous key regulators of sexual development across land plants. INTRODUCTION Sex chromosomes arise when an ordinary pair of autosomes gains the capacity to determine sex ([80]1). A defining characteristic of sex chromosomes is suppressed recombination in the heterogametic sex. It is widely believed that this lack of meiotic recombination makes natural selection less effective, predisposing nonrecombining chromosomes, like the mammalian Y, to degeneration and gene loss ([81]2, [82]3). However, although some nonrecombining chromosomes rapidly degenerate, or are completely lost, the sex chromosomes in other groups remain homomorphic or expand ([83]2). This diversity of form and gene content suggests that the role of suppressed recombination in the long-term trajectory of sex chromosome evolution must be modulated by other processes related to the life history of the organism. Identifying these important processes requires comparative analyses across multiple eukaryotic lineages. Many organisms, including bryophytes, algae, and some fungi, have a haploid UV sex chromosome system, in which females inherit a nonrecombining U and males inherit a nonrecombining V ([84]4, [85]5). The sex-specific transmission pattern of both chromosomes means that factors that are confounded in XY or ZW systems, such as suppressed recombination, hemizygosity, and sex-limited inheritance, are independent on UV chromosomes ([86]4–[87]6). Many UV sex chromosome systems may be ancient ([88]5), providing ample time for degenerative processes to act. However, the structural complexity of sex chromosomes has precluded genomic analyses in UV systems. Here, we evaluate the relative roles of gene gain and degeneration in shaping the evolution of the bryophyte UV sex chromosomes using nearly-gapless, chromosome-scale female and male genomes of the moss Ceratodon purpureus. RESULTS Assembly of C. purpureus female and male genomes Ancestral-state reconstructions of dioecy suggest that sex chromosomes evolved early in the history of the extant mosses ([89]7). To reconstruct the evolutionary history of the bryophyte UV sex chromosomes, we assembled and annotated chromosome-scale genomes of GG1 (female) and R40 (male) C. purpureus isolates. Although the C. purpureus genome is relatively small, the sex chromosomes are large and have extensive repeat content, making them a challenge to assemble ([90]8), particularly with short-read technologies, which often do not span a whole repeat. We therefore used a combination of Illumina, bacterial artificial chromosomes (BACs), PacBio, and Dovetail Hi-C (figs. S1 and S2 and tables S1 to S4). The version 1.0 genome assembly of R40 comprises 358 Mb in 601 contigs (N50 1.4 Mb), with 98.3% of the assembled sequence in the largest 13 pseudomolecules, corresponding to the 13 chromosomes in its karyotype ([91]9). The version 1.0 GG1 assembly is 349.5 Mb in 558 contigs (N50 1.4 Mb), with 97.9% of assembled sequence in the largest 13 pseudomolecules. Using more than 1.5 billion RNA sequencing (RNA-seq) reads for each of the genome lines (GG1 and R40) and additional de novo assemblies of other C. purpureus isolates (table S5), we annotated 31,482 genes on the R40 assembly and 30,425 on GG1 [BUSCO v3.0 of 69% using Embryophyte; 96.7 and 96.4%, respectively using Eukaryote; values similar to the moss Physcomitrium patens ([92]10)]. Identifying the moss ancestral chromosome elements To examine the conservation of genome architecture, we performed synteny analyses between the two C. purpureus genomes and the P. patens genome. GG1 and R40 were collected from distant localities (Gross Gerungs, Austria and Rensselaer, New York, USA, respectively) ([93]11), and we found that the assemblies had numerous structural differences ([94]Fig. 1). In the self-synteny analysis, we found clear homeologous chromosome pairs resulting from an ancient whole-genome duplication (WGD) ([95]Fig. 1 and fig. S2), consistent with previous transcriptomic ([96]11, [97]12) and our own Ks-based analyses (fig. S2 and table S6). We also identified abundant synteny between the C. purpureus and P. patens chromosomes, which diverged over 200 million years (Ma) ago ([98]Fig. 1) ([99]13). This result demonstrates that the ancestral karyotype of most extant mosses consisted of seven chromosomes ([100]13), which we refer to as ancestral elements A to G ([101]Fig. 1), and suggests that major parts of the gene content of moss chromosomes are stable over hundreds of millions of years, similar to the “Muller Elements” in Drosophila ([102]14). Curiously, we could not detect the homeologs of the C. purpureus chromosomes 5 and 9 using synteny, an observation we return to below. Fig. 1. Chromosome architecture in C. purpureus. [103]Fig. 1 [104]Open in a new tab (A) Dot plot of syntenic orthogroup blastp hits between C. purpureus GG1 and R40 isolates, showing structural variation on autosomes and a lack of synteny across the sex chromosomes. (B) Self-synteny plot of C. purpureus R40 isolate showing homeologous chromosomes from a WGD. (C) Dot plot of syntenic orthogroup blastp hits between C. purpureus R40 and P. patens, highlighting the seven ancestral chromosomes that we refer to as the moss ancestral elements A to G. (D) Density plots across C. purpureus chromosomes (in megabases). Densities show the proportion of a 100-kb window (90-kb jump) of each feature. Local density peaks of RLC5 Copia elements (purple Copia peaks) on each chromosome represent candidate centromeric regions, similar to P. patens ([105]13). Suppressed recombination causes weak degeneration but not gene loss The major exception to the long-term genomic stability observed in C. purpureus was the sex chromosomes, which also share no obviously syntenic regions with each other or the autosomes ([106]Fig. 1). The sex chromosomes are ~30% of each genome (110.5 Mb on the R40 V, 112.2 Mb on the GG1 U; [107]Fig. 1), four times the size of the largest autosome. The size is largely attributable to an increase in transposable elements (TEs), which comprise 78.2 and 81.9% of the U and V, respectively, similar to the nonrecombining Y or W sex chromosomes in other systems ([108]15), but far more than the C. purpureus autosomes [mean (μ), 46.4%; Mann-Whitney U with Benjamini and Hochberg correction (MWU), autosomes to U or V P < 2 × 10^−16; [109]Fig. 1]. While some TEs have a homogeneous distribution across all chromosomes (e.g., Copia; μ: autosomes = 0.8%, U = 1.3%, V = 1.2%; MWU, all pairwise comparisons P > 0.09), the U and V chromosomes are enriched for very different classes of repeats compared to each other and the autosomes. For example, the U was enriched for hAT (μ: autosomes = 2.3%, U = 10.1%, V = 7.4%; MWU, all pairwise comparisons P < 1.5 × 10^−14) and the V was enriched in a previously undescribed superfamily of cut-and-paste DNA transposons, which we refer to as Lanisha elements (μ: autosomes = 1%, U = 1.2%, V = 5.8%; MWU, all pairwise comparisons P < 1 × 10^−4; [110]Fig. 1 and fig. S3). The distribution of repeats in C. purpureus and the physical proximity of the autosomes inferred from the Hi-C contact map (fig. S2) together highlight the enigmatic isolation of the sex chromosomes in the nucleus ([111]16). Unlike other nonrecombining sex chromosomes, neither the U nor the V shows signs of major degeneration beyond the increased TE density. Sex-linked genes used on average one more codon than autosomes [effective number of codons (ENC)], less frequently use optimal codons [frequency of optimal codon (fop)], have a loss of preferred GC bias in the third synonymous codon position (GC3s), and have a higher rate of protein evolution (dN/dS), all consistent with weaker selection (MWU, autosomes to U or V P < 6 × 10^−6 for all metrics; [112]Fig. 2). Although, notably, the U- and V-linked genes were not different from one another (MWU, ENC P = 0.8; fop P = 0.22; GC3s P = 0.18; dN/dS P = 0.73), suggesting that transmission through one sex or the other has no detectable effect on purifying selection. Consistent with this observation, the U and the V have 3450 and 3411 transcripts, respectively, representing ~12% of the C. purpureus gene content. This stands in stark contrast to the nonrecombining mammalian Y chromosome, or even other UV systems, which typically contain an order of magnitude fewer genes, at most ([113]17–[114]19). These observations indicate that although suppressed recombination decreases the efficacy of natural selection, alone, it is insufficient to drive gene loss on nonrecombining sex chromosomes ([115]20). Fig. 2. Molecular evolution of autosomal and sex-linked genes in C. purpureus. [116]Fig. 2 [117]Open in a new tab (A) Autosomal genes are significantly different from U- or V-linked genes in the ENC. (B) fop. (C) GC content of the third, synonymous codon (GC3s) and (D) protein evolution (dN/dS) (MWU, autosomes to U or V P < 6 × 10^−6 for all metrics, indicated by ***; numbers show means). However, U- and V-linked genes were not significantly different [MWU, ENC P = 0.8; fop P = 0.22; GC3s P = 0.18; dN/dS P = 0.73, indicated by NS (not significant)], suggesting weak but not significantly different degeneration on the U and V. For dN/dS, four U-linked genes and two V-linked genes fell above the given scale of the y axis. Moss sex chromosomes are ancient but evolutionarily dynamic The lack of degeneration means that thousands of genes can be used to reconstruct a detailed history of gene gain on the C. purpureus UV sex chromosomes. Critically, the times to the most recent common ancestor between orthologous genes on the U and V chromosomes allow us to estimate a minimum age for the sex chromosome system. In principle, the evolution of sex linkage should mimic the effects of a gene duplication, with identical U- and V-linked clades that coalesce at the node where recombination between them ceased (fig. S4). To identify these nodes, we used a phylogenomic approach with stringent inclusion criteria. We built 744 gene trees, 402 with U- and V-linked homologs. We found that most genes became sex-linked in the C. purpureus lineage, after the divergence from Syntrichia princeps (μ Ks = 0.16; [118]Fig. 3 and table S7). However, 13 U-V orthologous pairs diverged at the base of the Dicranidae (μ Ks = 0.85), and three pairs diverged before the split between the two diverse clades Bryidae and Dicranidae (μ Ks = 1.64). The most ancient U-V divergence (a Zinc finger Ran binding protein of unknown function) was before the split between Buxbaumia aphylla and the remaining Bryopsida, ~300 Ma ago [based on previous fossil-calibrated, relaxed-clock analyses ([119]21)] (Ks = 2.8; fig. S4). Fig. 3. Evolutionary history of moss and liverwort sex chromosomes. [120]Fig. 3 [121]Open in a new tab (A) Capture events of genes on moss and liverwort sex chromosomes. Numbers indicate how many extant genes were captured at the indicated branch based on the topology of the tree. The capture events in mosses can be traced back to three ancestral elements (A, B, and D), where the oldest sex-linked genes were from ancestral element A and homeologous chromosomes from B and D fused to the sex chromosomes. (B) Ks of one-to-one U-V orthologs plotted on U and V sex chromosomes of C. purpureus. Lines connect the U-V orthologs, where colors correspond to the ancestral elements in (A). The darker brown lines indicate genes with Ks ≤ 0.02, presumably representing the most recently captured genes, which highlights the rapid rearrangement of genes on the sex chromosomes. These data, in addition to synteny ([122]Fig. 1), also suggest a lack of a pseudo-autosomal region between the C. purpureus U and V. It is possible that the Zinc finger gene duplicated before the split between B. aphylla and the remaining Bryopsida, and these duplicates each were independently captured by the sex-determining locus in these two lineages. However, a more parsimonious explanation is that the origin of the bryophyte sex chromosome system predated the divergence between B. aphylla and the remaining Bryopsida. This observation provides support for the maximum parsimony ancestral state reconstruction of sexual system of McDaniel et al. ([123]7) but pushes back the origins of dioecy in the common ancestor of the Bryidae and Dicranidae to before the origin of the arthrodontous mosses ([124]Fig. 3). These data also provide an independent means to evaluate the inferred transition bias toward dioecy suggested by that analysis. A classic signature of gene capture on sex chromosomes is the presence of strata, where neighboring genes added in the same recombination suppression event have a similar Ks ([125]22). However, on the C. purpureus sex chromosomes, we found that Ks was not associated with gene order ([126]Fig. 3). Even genes with very low Ks, presumably from the most recent recombination suppression event, were found across the entirety of the U or V, meaning that gene order was shuffled soon after the evolution of sex linkage. To understand the mechanism by which the region of suppressed recombination acquires new genes, we combined inferences from phylogenomic analyses with the physical position of orthologs among the ancestral karyotypic elements. When we examined gene trees for the two most recent capture events, we found that the overwhelming majority are from ancestral elements D (~80% of C. purpureus–specific captures; table S7) and B (~92% of Dicranidae captures; table S7) indicating that the missing homeologous chromosomes to C. purpureus 5 and 9, respectively, had fused to the sex chromosomes ([127]Fig. 3), but the scrambling of gene order had rendered them undetectable using synteny alone. The mechanism by which the chromosome fusions occur in UV systems is not entirely clear. In other systems, the sex chromosomes have a pseudo-autosomal region (PAR) that is linked to the nonrecombining portion of the sex chromosome. Unlike the sex-limited region, the PAR pairs normally at meiosis, presumably to assure 1:1 segregation. The addition of new genes to both sex chromosome partners can proceed through the expansion of suppressed recombination into the PAR or the translocation of other genomic regions to the PAR. Curiously, we could find no trace of a PAR in the C. purpureus assemblies, nor in any of the unassembled scaffolds ([128]Figs. 1 and [129]3). We suspect that the PAR reported in C. purpureus genetic maps ([130]7, [131]8) reflects artifactual linkage generated by wide genetic crosses ([132]8, [133]23). Nevertheless, abundant genotyping of haploid spores from single diploid sporophytes strongly suggests that some PAR-independent mechanism enforces 1:1 segregation of the U and V sex chromosomes ([134]24). Thus, while small genomic regions could become independently incorporated into the U or the V, the translocation of whole chromosomes would likely result in homologous pairing between the neo-sex chromosome arms and ultimately the incorporation of the region into both the U and V. To extend the ancestral reconstruction to liverwort sex chromosomes, we generated gene trees using transcriptome data combined with previously identified sex-linked genes in Marchantia polymorpha ([135]18). Like in C. purpureus, we found no evidence of syntenic strata when we compared Ks between the U- and V-linked orthologs (table S8). We also found evidence of four liverwort-specific capture events, with the oldest diverging ~400 Ma ago, before the split of Marchantiidae and Pelliidae ([136]Fig. 3 and fig. S4) ([137]21). Our analyses show that most sex-linked genes in M. polymorpha (table S8), like two of the oldest genes in C. purpureus (table S7), have homologs from moss ancestral element A. This new insight leads to the remarkable suggestion that this element played a key role in sex determination early in the history of both lineages, ~500 Ma ago, making the Setaphyte sex chromosomes among the oldest known to date across Eukarya. The C. purpureus sex chromosomes harbor broadly expressed, conserved regulators of sexual development A key factor explaining the retention of transcripts on nonrecombining sex chromosomes is broad gene expression ([138]25, [139]26), which in plants includes the haploid phase. In transcriptomic data from multiple tissues, we found more than 1700 U- and V-linked genes expressed (mean count ≥ 1) (fig. S5 and tables S9 to S11), including essential components of the cytoskeleton (e.g., tubulin) and DNA repair complexes (e.g., RAD51). We found that the number of sex-biased autosomal genes (mean count ≥ 1, fold change ≥ 2, adjusted P ≤ 0.05) was far eclipsed by expressed sex-specific genes (i.e., those only on the U or V), suggesting that sex-linked loci contribute more to expression differences between the sexes than do autosomes (fig. S5). Furthermore, in contrast to data from gene-poor sex chromosome systems, we found that nearly all the genes in the female- and male-specific coexpression modules, including the hubs, were sex-linked (fig. S6 and table S12). The sex-specific gene expression networks are enriched for proteins with known reproductive functions across green plant lineages. For example, the male Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) terms show enrichment for microtubule-based processes, which play a role in sperm production in other systems (fig. S6 and tables S13 and S14) ([140]27, [141]28). We also found that both female and male coexpression modules are enriched for genes involved in circadian rhythm, like phytochrome, which are involved in flower development in Arabidopsis thaliana ([142]29). The male coexpression module also contained a V-specific ABC1 gene orthologous to a V-linked copy in M. polymorpha (fig. S7), and genes in this family are involved with pollen development in angiosperms ([143]30). The female coexpression module contains a U-specific RWP-RK transcription factor (TF) orthologous to M. polymorpha MpRKD, which is a conserved component of the egg development pathway across land plants and are mating-type loci in green algae (fig. S7) ([144]17, [145]31, [146]32). Moreover, the cis-acting sexual dimorphism switch MpFGMYB ([147]33), which promotes female development in M. polymorpha, as well as the P. patens CRINKLY4 (PpCR4) gene, which functions in archegonial neck development (fig. S7) ([148]34), have orthologous U- and V-linked copies in C. purpureus (fig. S7). Several other TFs or transcriptional regulators (TRs) are found in the sex-specific coexpression modules (e.g., V-linked R2R3-MYB) or are only found on the U or V (e.g., HD DDT, Med7, and SOH1; table S15), together suggesting that candidate regulators of sex-specific developmental processes are enriched on the C. purpureus UV sex chromosomes. In addition, 187 U- and/or V-linked genes are homologous to over 250 A. thaliana genes with reproductive roles (fig. S5 and table S16). It is, of course, difficult to prove shared gene function across these diverse taxa, particularly for genes found in large gene families, and neo-functionalization can lead to altered functions in specific lineages. Nevertheless, complementing mutants of these genes in hermaphroditic species, like A. thaliana or P. patens, with sex-linked homologs from C. purpureus is likely to provide a powerful means to interrogate the evolution and function of sex-limited gene regulatory networks. DISCUSSION Our analyses challenge the idea that suppressed recombination and sex-limited inheritance are sufficient to drive sex chromosome degeneration. Clearly, the lack of meiotic recombination both weakens purifying selection, which results in decreased codon bias and increased protein evolution, and facilitates massive structural variation and highly differentiated TE accumulation between the U and V. Like in other plants, haploid gene expression in C. purpureus apparently slows sex chromosome degeneration, even over millions of years of suppressed recombination ([149]25). However, unlike flowering plants, where hermaphroditism is the norm ([150]35), the antiquity of dioecy in bryophytes more closely mirrors the sexual systems in animals ([151]36, [152]37). Thus, the gene-rich C. purpureus sex chromosomes provide a powerful comparative tool for studying the long-term evolution of sex-limited gene regulatory networks that govern sexual differentiation. MATERIALS AND METHODS Isolate collection and tissue culture All C. purpureus tissue used in this study was isolated from a single spore ([153]24), from field-collected sporophytes (table S5) ([154]11, [155]38). In-depth methods for tissue generation for DNA and RNA, library preparation, and sequencing can be found in Supplementary Materials and Methods. Genome assemblies We sequenced C. purpureus (var. GG1 and var. R40) using a whole-genome shotgun sequencing strategy and standard sequencing protocols. Sequencing reads were collected using Illumina, PacBio, and Sanger platforms. Illumina, PacBio, and Sanger reads were sequenced at the Department of Energy Joint Genome Institute in Walnut Creek, California and the HudsonAlpha Institute in Huntsville, Alabama. Illumina reads were sequenced using the Illumina HiSeq 2000 and X10 platform, and the PacBio reads were sequenced using the SEQUEL I platform. Sanger BACs were sequenced using an ABI 3730XL capillary sequencer. For both GG1 and R40, one 400–base pair (bp) insert 2 × 150 Illumina fragment library (133.14× for GG1, 146.45× for R40) was sequenced along with one 2 × 150 Dovetail Hi-C library (252.86× GG1, 442.71× R40) (table S1). Before assembly, Illumina fragment reads were screened for phix contamination. Reads composed of >95% simple sequence were removed. Illumina reads <50 bp after trimming for adapter and quality (q < 20) were removed. For the PacBio sequencing, a total of eight chemistry 2.1 cells (10-hour movie time) were sequenced each for GG1 and R40 on Sequel 1 with a raw sequence yield of 39.82 Gb (GG1) and 46.24 Gb (R40) with a total coverage of 113.77× (GG1) and 132.11× (R40) (table S2). Last, a total of 1032 BAC clones sequenced with Illumina indexed libraries were used for patching the final chromosome gaps. Genome assembly and construction of pseudomolecule chromosomes Improved versions 1.0 of the C. purpureus (var. GG1 and var. R40) assemblies were generated by separately assembling the 4,195,510 PacBio GG1 reads (113.77× sequence coverage) and 5,238,148 PacBio reads R40 (132.11× sequence coverage) using the MECAT assembler ([156]39) and subsequently polished using QUIVER ([157]40). For GG1, this produced 637 scaffolds (637 contigs), with a contig N50 of 1.2 Mb, 475 scaffolds larger than 100 kb, and a total genome size of 347.1 Mb (table S3). For R40, this produced 731 scaffolds (731 contigs), with a contig N50 of 1.1 Mb, 497 scaffolds larger than 100 kb, and a total genome size of 361.3 Mb (table S3). Hi-C scaffolding using the JUICER pipeline ([158]41) was used to identify misjoins in the initial MECAT assembly. Misjoins were characterized as a discontinuity in the GG1 or R40 linkage group. A total of 73 misjoins were identified and resolved in GG1 and 64 in R40. The resulting broken contigs were then oriented, ordered, and joined together into 13 chromosomes (12 autosomal and 1 sex chromosome designated as “U” in the GG1 release and 12 autosomal and 1 sex chromosome designated as “V” in the R40 release) using both the map and the Hi-C data. A total of 579 joins were made in GG1 and 625 in R40 during this process. Each chromosome join is padded with 10,000 Ns. Significant telomeric sequence was identified using the (TTTAGGG)[n] repeat, and care was taken to make sure that it was properly oriented in the production assembly. The remaining scaffolds were screened against bacterial proteins, organelle sequences, and GenBank non-redundant database and removed if found to be a contaminant. For GG1, a set of 1032 BAC clones (107.8-Mb total sequence) sequenced with Illumina indexed libraries were used to patch remaining gaps in the chromosomes. Clones were aligned to the chromosomes using BLAT ([159]42), and clone contigs crossing gaps were used to form patches. A total of 35 gaps were patched. Last, homozygous single-nucleotide polymorphisms (SNPs) and insertions or deletions (INDELs) were corrected in the release consensus sequence using ~88× of Illumina reads (2 × 150, 400-bp insert) by aligning the reads using bwa mem ([160]43) and identifying homozygous SNPs and INDELs with the Genome Analysis Toolkit UnifiedGenotyper tool ([161]44). A total of 108 homozygous SNPs and 5291 homozygous INDELs in GG1 and 19 homozygous SNPs and 867 homozygous INDELs in R40 were corrected in the release. The final version 1.0 GG1 release contains 349.5 Mb of sequence (1.3% gap), consisting of 558 contigs with a contig N50 of 1.4 Mb and a total of 97.9% of assembled bases in chromosomes. The final version 1.0 R40 release contains 358.0 Mb of sequence (1.2% gap), consisting of 601 contigs with a contig N50 of 1.4 Mb and a total of 98.3% of assembled bases in chromosomes. Completeness of the euchromatic portion of the version 1.0 GG1 and 1.0 R40 assemblies was assessed by aligning an RNA-seq library (library code GNGZB for GG1 and GNGZC for R40). The aim of this analysis is to obtain a measure of completeness of the assembly, rather than a comprehensive examination of gene space. The transcripts were aligned to the assembly using GSNAP ([162]45). The alignments indicate that 96.88% of the GG1 RNA-seq reads aligned to the version 1.0 GG1 release and 97.01% of the R40 RNA-seq reads aligned to the version 1.0 R40 release. Construction of the scaffold assembly A total of 4,195,510 PacBio reads (113.77×) in GG1 and 5,238,148 PacBio reads (132.11×) in R40 were assembled using MECAT ([163]39) and formed the starting point of the version 1.0 release for each. The 310,662,272 Illumina sequence reads (133.14× sequence coverage) in GG1 and 353,932,084 Illumina sequence reads (146.45× sequence coverage) in R40 were used for fixing homozygous SNP/INDEL errors in the consensus. A total of 310,662,272 Hi-C reads (252.86× sequence coverage) in GG1 and 1,062,837,932 Hi-C reads (442.71× sequence coverage) in R40 were used for chromosome construction. Screening and final assembly release Scaffolds that were not anchored in a chromosome were classified into bins depending on sequence content. Contamination was identified using blastn against the National Center for Biotechnology Information (NCBI) nucleotide collection (NR/NT) and blastx using a set of known microbial proteins. In GG1, additional scaffolds were classified as repetitive (>95% masked with 24-mers that occur more than four times in the genome) (16 scaffolds, 482.8 kb), chloroplast (1 scaffold, 158.7 kb), and low quality (>50% unpolished bases after polishing, 3 scaffolds, 48.3 kb). In R40, additional scaffolds were classified as repetitive (>95% masked with 24-mers that occur more than four times in the genome) (12 scaffolds, 489.6 kb), chloroplast (1 scaffold, 50.2 kb), and low quality (>50% unpolished bases after polishing, 6 scaffolds, 236.8 kb). Resulting final statistics are shown in table S4. GG1 assessment of assembly accuracy A set of 17 finished contiguous Sanger BAC clones >100 kb were selected to assess the accuracy of the assembly. A range of variants were detected in the comparison of the BAC clones and the assembly. In 14 of the BAC clones, the alignments were of high quality (<0.05% base pair error) with an example being given in fig. S1. All dot plots were generated using Gepard ([164]46). The remaining three BACs indicate a higher error rate due mainly to their placement in more regions containing tandem repeats (fig. S1). The overall base pair error rate in the BAC clones is 0.016% (269 discrepant bp of 1,599,605 bp). Genome annotations Transcript assemblies were made from ~1.5 billion pairs of 2 × 150 stranded paired-end Illumina RNA-seq reads from C. purpureus GG1 and ~1.6 billion pairs from C. purpureus R40 using PERTRAN, which conducts genome-guided transcriptome short-read assembly via GSNAP ([165]47) and builds splice alignment graphs after alignment validation, realignment, and correction ([166]48), PERTRAN assemblies from G100m_X_G150f_Sporo reads on the C. purpureus GG1 or R40 genome, and filtered open reading frames (ORFs) from Trinity assemblies from stranded paired-end Illumina reads from additional C. purpureus cultivars (is Navarino, Magellanes, Chile; Durham, NC, USA; Otavalo, Ecuador; Renesselaer, NY, USA; and Storrs, CT, USA; table S5). 180,954 (GG1) and 194,414 (R40) transcript assemblies were constructed using the Program to Assemble Spliced Alignments (PASA) ([167]49) from RNA-seq transcript assemblies above and a bit of C. purpureus expressed sequence tags (ESTs). Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from A. thaliana ([168]50), soybean ([169]51), Setaria viridis ([170]52), grape ([171]53), Sphagnum magellanicum, and P. patens ([172]13), Selaginella moellendorffii ([173]54), and Chlamydomonas reinhardtii ([174]55), filtered Trinity assembly ORFs described above, high-confidence gene models from the first round of C. purpureus R40 gene call, UniProt Bryopsida, and Swiss-Prot proteomes to repeat-soft-masked C. purpureus GG1 genome using RepeatMasker ([175]56) with up to 2000-bp extension on both ends unless extending into another locus on the same strand. Repeat library consists of de novo repeats by RepeatModeler ([176]57) on C. purpureus GG1 genome and repeats in RepBase. Gene models were predicted by homology-based predictors, FGENESH+ ([177]58), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and EXONERATE ([178]59) and PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS via BRAKER1 ([179]60). The best-scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding untranslated regions, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to abovementioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score, and protein coverage is the highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected on the basis of Cscore, protein coverage, EST coverage, and its coding sequence (CDS) overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage is larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis, and gene models whose protein is more than 30% in Pfam TE domains were removed and considered weak gene models. Incomplete gene models, low homology supported without fully transcriptome-supported gene models of short single exons (<300-bp CDS) with neither protein domain nor good expression gene models were manually filtered out. Synteny analysis within C. purpureus and between P. patens We ran the default GENESPACE pipeline ([180]48) with a minimum block size of 5 genes and a maximum gap/search radius of 15 genes. In short, GENESPACE runs Orthofinder ([181]61, [182]62) on synteny-constrained blastp hits. This offers higher stringency when exploring highly diverged genomes (or ancient WGDs) by removing high-scoring, but randomly distributed, blast hits. Ks plot analysis to identify the C. purpureus WGD WGDs were detected with conventional Ks plot analyses. We used the wgd pipeline ([183]63). An all-by-all BLASTP search ([184]64) was performed for the C. purpureus GG1 and R40 genomes as well as P. patens and M. polymorpha. Paralogs were clustered with MCL ([185]65). For each cluster, all pairwise Ks estimates were obtained from PAM ([186]66) with the GY94 model with F3x4 equilibrium codon frequencies ([187]67). Hierarchical clustering was used to reduce redundant comparisons and obtain node-averaged Ks estimates. This process was repeated for syntenic paralogs too, which were obtained from I-ADHoRe v3.0 with default settings ([188]68) based on all-by-all BLASTP results. Orthologous gene divergences used reciprocal best BLASTP hits between C. purpureus and P. patens. Peaks in Ks plots can be identified visually, but we also applied mixture models that were selected by the difference in BIC scores, such that a difference less than 3.2 is used as a stopping criterion. Mixture models were implemented with the bic.test.wgd function available on GitHub ([189]https://github.com/gtiley/Ks_plots). Mixture models can be problematic in their interpretation because of overfitting; therefore, we looked for peaks that were consistently detected across models and the maximum Ks value allowed ([190]69). When analyzing all paralogs, a single prominent peak was observable in C. purpureus with a mean between a Ks of 0.65 and 0.97 in GG1 and a Ks between 0.68 and 0.74 in R40 (table S6). The more consistent results in R40 imply that more paralogs from this WGD event have survived on the V chromosome compared to the U chromosome. This WGD postdates the divergence of C. purpureus and P. patens (fig. S2). This is determined by visual inspection but agrees with previous analyses of WGD in both C. purpureus and P. patens ([191]11–[192]13). The presence of a single WGD that occurred in C. purpureus following divergence from P. patens is supported by analyses of syntenic paralogs as well (fig. S2), which suggests slightly more recent WGD ages (table S6). However, analyses of syntenic paralogs from P. patens supported the presence of two WGDs following divergence from C. purpureus (table S6), similar to previous findings when using syntenic data ([193]13) compared to all paralogs from genomic or transcriptomic data ([194]12, [195]70). Ks plot analyses are provocative of older WGD events that predate the divergence of C. purpureus and P. patens. Notably, low numbers of syntenic paralogs are evident between Ks of 3.0 and 4.0; although, the same is true for M. polymorpha that putatively has no history of ancient WGD. Any identifiable peaks in Ks plot analyses are too speculative given the lack of evidence from mixture models and nor does their existence affect our proposed model of karyotype evolution. It should be noted though that analyses of gene trees that reconcile duplication and loss events onto a species tree have implied a shared large-scale duplication event shared by C. purpureus and P. patens [“B3” ([196]12)] and an even older event shared by all mosses [“B2” ([197]12)]. Testing these ancient hypotheses is beyond the scope of Ks plot analyses, even with syntenic data. Rather, macrosyntenic evidence from more moss species, such as Sphagnum fallax, will be needed to identify the presence of expected syntenic ratios among genes, similar to the identifiable 1:4 ratios between C. purpureus and P. patens investigated here. TE annotation We combined the R40 assembly (autosomes and V) with the U sex chromosome assembled from GG1 to run de novo repeat detection using the TEdenovo pipeline from the REPET package (v2.4) ([198]71). Parameters were set to consider repeats with at least five copies. We obtained a library of 4699 consensus sequences that was filtered to keep only those that are found at least once as a full-length copy in the combined assembly, and we retained 2523 of them. This library of consensus sequences was then used as digital probe for whole-genome annotation by the TEannot ([199]72) pipeline from the REPET package v2.4. Threshold annotation scores were determined for each consensus as the 99th percentile of the scores obtained against a randomized sequence [reversed input, not complemented and masked using Tandem Repeats Finder with parameters 2 7 7 80 10 70 10 ([200]73)]. The library of consensus sequences was classified using PASTEC followed by manual curation ([201]74). To improve classification, remote homology detection was performed using HH-suite3 ([202]75). For the density plot of genes and TEs ([203]Fig. 1), we calculated the proportion of coverage of each feature in a 100-kb window with a 90-kb jump using Bedtools (v2.27.) ([204]76). These results were plotted in R (v3.5.3) ([205]77) using the package karyoploteR (v1.8.8) ([206]78) (ceratodon_genome_plots.R, [207]https://doi.org/10.5061/dryad.v41ns1rsm). To examine differences in enrichment between the autosomes, U, and V, we ran a pairwise MWU for multiple tests ([208]79, [209]80) using the sliding window densities (n[Auto] = 2736, n[U] = 1247, n[V] = 1229). TF and regulator annotation Transcription-associated proteins (TAPs) comprise TFs (acting in a sequence-specific manner, typically by binding to cis-regulatory elements) and TRs (acting on chromatin or via protein-protein interaction). We classified all C. purpureus proteins into 122 families and subfamilies of TAPs by a domain-based rule set ([210]81, [211]82). We compared this genome-wide classification with relevant organisms. All proteins in which a domain was found are listed with their family assignment. In cases when the domain composition does not allow an unambiguous assignment, they are assigned no_family_found. Gene expression and coexpression Gene expression and coexpression analyses were done using three male-female sibling pairs (n[isolates] = 6, three of each sex) at gametophore and protonemal stages (n[stages] = 2) in triplicate (n[replicates] = 3) (table S5; see Supplementary Materials and Methods for details on tissue conditions). Raw reads were filtered for contaminants and adapters removed using BBDuk (v38.00) (Bushnell, [212]http://bbtools.jgi.doe.gov). This included removing reads with 93% identity to human, mouse, dog, or cat or aligning to common microbial references. Further filtering removed reads with any “N’s,” an average