ABSTRACT

   Chinese serow (Capricornis milneedwardsii) is mainly distributed in the
   south of Yellow River in China, which has been listed as vulnerable by
   the International Union for Conservation of Nature (IUCN). However, the
   reference genome of serow has not been reported and its taxonomic
   status is still unclear. Here, we first constructed a high‐quality
   chromosome‐level reference genome of C. milneedwardsii using PacBio
   long HiFi reads combined with Hi‐C technology. The assembled genome was
   ~2.83 Gb in size, with a contig N50 of 100.96 Mb and scaffold N50 of
   112.75 Mb, which were anchored onto 24 chromosomes. Furthermore, we
   found that the Chinese serow was more closely related to muskox, which
   diverged from ~4.85 million years ago (Mya). Compared to the karyotype
   of goat (2n = 60), we found the Chinese serow (2n = 48) experienced six
   chromosome fusions, which resulted in the formation of six central
   centromere chromosomes. We also identified two positively selected
   genes (MYH6 and DCSTAMP) specific to Chinese serow, which were involved
   in ‘viral myocarditis’ and ‘Cardiac muscle contraction’. Interestingly,
   compared to other Caprinae animals, the MYH6 protein of Chinese serow
   occurred two mutations (E1520S and G1521S), which might be related to
   hypoxia tolerance. The high‐quality reference genome of
   C. milneedwardsii provides a valuable information for protection of
   serows and insights into its evolution.

   Keywords: Chinese serow, chromosomal evolution, Hi‐C, HiFi, hypoxia
   tolerance
     __________________________________________________________________

   We constructed a high‐quality chromosome‐level reference genome of
   C. milneedwardsii for the first time. The Chinese serow was more
   closely related to muskox, and the karyotype of serow (2n = 48)
   occurred six chromosome fusions. Interestingly, compared to other
   Caprinae species, the MYH6 protein of Chinese serow occurred two
   mutations (E1520S and G1521S) which were conserved in Cetaceans.

   graphic file with name ECE3-14-e70400-g004.jpg

1. Introduction

   The serow (Capricornis spp.) is a typical bovid herbivore in the
   tropical and subtropical regions of southeast Asia, belonging to the
   Caprinae subfamily (Meng et al. [44]2009), which has been listed as
   vulnerable by the IUCN (Chen et al. [45]2021). The Capricornis genus is
   comprised of six species: Chinese serow (C. milneedwardsii), Sumatran
   serow (C. sumatraensis), Formosan serow (C. swinhoei), Himalayan serow
   (C. thar), Japanese serow (C. crispus) and Red serow (C. rubidus)
   according to the Mammal Species of the World (3rd edition) (Wilson and
   Reeder [46]2005). Chinese serow, also known as Tenma, is mainly
   distributed in the south of Yellow River in China (Figure [47]1), which
   can be divided into five subspecies according to the physiological
   characteristics and geographical location (Wang [48]1990).

FIGURE 1.

   FIGURE 1
   [49]Open in a new tab

   Chinese serow in Qinling Mountains (Photographer: Naxun Zhao).

   Based on the morphology, characteristic of chromosomes or analysis of
   mitochondrial cytochrome b genes (Cytb) sequences, serow (2n = 48) most
   closely related to goral (2n = 50) (Jantarat et al. [50]2017). Compared
   to other Caprinae animals, serow was more closely related to goral
   based on the mitochondrial genome sequence (Dou, Zhang, and
   Feng [51]2016; Song et al. [52]2023). However, the mitochondrial
   phylogeny might be mixed with different species (Song et al. [53]2023).
   Therefore, the phylogenetic analysis of the whole‐genomic level was
   required to classify the taxonomic status of serow.

   In this study, we constructed a chromosome‐level reference genome of
   C. milneedwardsii using PacBio long HiFi reads combined with Hi‐C
   technology. Furthermore, the taxonomic status and chromosomal evolution
   of serow were analyzed. What's more, positive selection gene and rapid
   evolution gene were identified. These findings will be helpful to
   understand the evolution of Caprinae species.

2. Materials and Methods

2.1. Sample Collection and Ethics Statement

   The liver sample from a dead adult female Chinese serow collected in
   December 2022 at Louguantai, Zhouzhi, Xi'an City, Shaanxi Province,
   China was used for genome sequencing. Sample collection and the
   experimental procedures were carried out according to the guidelines of
   the China Council on Animal Care and approved by the Experimental
   Animal Management Committee (EAMC) of Northwest A&F University
   (Approval ID: XN2023‐0619).

2.2. Genome Sequencing and Assembly

   Obtaining HiFi reads using the PacBio Sequel II platform. First, the
   software HiFiAdapterFilt (v2.0.0) (Sim et al. [54]2022) was used with
   default parameters 44 bp and 97% match to remove the remaining adapter
   sequences from PacBio HiFi reads. Only 7 reads contaminated with
   adapters were removed. Second, the remaining HiFi reads were assembled
   de novo to the contig level using hifiasm (v0.19.5) with default
   parameters (Cheng et al. [55]2021). The clean Hi‐C reads were aligned
   to the contigs using Chromap (v0.2.4) with the additional parameter –
   remove‐pcr‐duplicates (Zhang et al. [56]2021). Third, the alignments
   sorted by name were used as input for YaHs (v1.2a.1) to perform genome
   scaffolding (Zhou, Mccarthy, and Durbin [57]2023). Finally,
   visualization and manual inspection of the Hi‐C interaction matrix were
   conducted in Juicebox (Durand et al. [58]2016). The TGS‐GapCloser
   (v1.2.1) (Xu et al. [59]2020), Pilon (v1.23) (Walker et al. [60]2014),
   BUSCO (v5.4.7) (Simao et al. [61]2015) and the mammalia_odb10 database
   were used to assess genome completeness. The distribution of GC content
   in the genome was determined using bedtools (v2.31.0) software with a
   window size of 1 Mb (Quinlan [62]2014). The genome resequencing was
   conducted using the Illumina NovaSeq6000 platform. The software GCE
   (v1.0.2) ([63]https://arxiv.org/abs/1308.2012v2) was used to estimate
   the genome size and heterozygosity. The estimation was based on 40×
   Illumina WGS reads using the k‐mer method with a k‐mer size of 17.

2.3. Repeats and Gene Annotation

   RepeatMasker (v4.1.5) ([64]https://www.repeatmasker.org) and repeat
   database (including Dfam v.3.7 and RepBase library October 26, 2018)
   were used to identify repeats in the serow genome. The software Liftoff
   (v1.6.3) (Shumate and Salzberg [65]2021) was utilized to map the
   annotation of the sheep genome (ARS‐UI_Ramb_v2.0_GCF_016772045.1,
   annotation‐source NCBI Ovis aries Annotation Release 104) onto the
   serow genome. The parameters were polish, copies, and flank 0.2. The
   input comprised the genome sequences of sheep and serow, along with the
   GFF‐formatted annotation of the sheep genome. Minimap2 (v2.26‐r1175)
   (Li [66]2018) was used to align the entire gene sequence, including
   exons and introns, to the target, resulting in the gene annotation file
   of the serow, from which protein‐coding genes were filtered.

   The Circos (v0.69–9) software (Krzywinski et al. [67]2009) was used to
   visualize the details of the serow genome. The outermost track
   represents chromosome lengths, followed by gene density, repetitive
   sequences, and GC content, each with a window size of 1 Mb. Centromere:
   50,000 bp was defined as the longest merging distance between the two
   sections, and the location of centromere on each chromosome was found
   with the interval span greater than 200,000 bp as the filter condition.
   Telomeres: Telomere positions on each chromosome were determined using
   a filtering criterion of telomeric repeat sequences extending at least
   500 bp from the chromosome ends. Additionally, we assessed the presence
   of telomeric repeat sequence (TTAGGG) within the terminal 20 kb of the
   autosomal telomeric sequences.

   Using the AGAT (v1.2.0) software to process annotation results, the
   agat_sp_keep_longest_isoform.pl. script was executed to extract the
   annotation file of the longest transcript sequences from the annotation
   file of the serow genome. The agat_sp_extract_sequences.pl. script was
   executed to obtain the protein sequences of the longest transcripts.
   CDSs (average CDSs length per gene) and exon average length were
   calculated from the longest transcription sequence annotation file of
   serow. According to the longest transcription sequence annotation file
   of serow, the intron sequence position is obtained by taking the
   difference set between gene sequence position and exon sequence, and
   then its average length is calculated.

2.4. Mitochondrial Genome Assembly

   The MitoHiFi (v2.2) software (Uliano‐Silva et al. [68]2023) was used to
   assemble the mitochondrial genome from contigs using the serow
   mitochondrial genome [69]NC_023457.1 as a reference. Mitochondrial
   sequences were retrieved from contigs using BLASTn, and sequences with
   alignment length > 80% of mitochondrial length and identical matches
   > 90% were identified as mitochondrial sequences. All 64 sequences
   identified were small contig (< 60 kb) that did not interact with other
   sequences. The mitochondrial genome was annotated with MitoFinder
   (v1.4) (Allio et al. [70]2020) and visualized by OGDRAW (Greiner,
   Lehwark, and Bock [71]2019).

2.5. Phylogenetic Analysis and Divergence Time Estimation

   The phylogenetic tree was constructed including 18 mitochondrial
   genomes: C. milneedwardsii (OR_551470), Ovis aries ([72]NC_001941.1),
   Ovis ammon ([73]KT781689.1), Ovis nivicola ([74]NC_039431.1), Capra
   sibirica ([75]NC_020626.1), Capra hircus ([76]MZ073671.1), Capra
   aegagrus ([77]NC_028161.1), Oreamnos americanus ([78]NC_020630.1),
   Budorcas taxicolor ([79]NC_013069.1), Ovibos moschatus
   ([80]NC_020631.1), Pantholops hodgsonii ([81]NC_007441.1), Bos taurus
   ([82]NC_006853.1), Hemitragus hylocrius (the mitochondrial genome was
   assembled based on the Illumina WGS sequencing data SRR11430235),
   Pseudois nayaur ([83]NC_020632.1), Ammotragus lervia ([84]NC_009510.1),
   Ovis orientalis ([85]NC_026063.1), Naemorhedus goral ([86]NC_021381.1)
   and Rupicapra rupicapra ([87]NC_020633.1). Mitochondrial genome was
   applied to MEGA (v11) (Tamura, Stecher, and Kumar [88]2021). The
   ClustalW was used for sequence alignment, and the Maximum Likelihood
   method was selected with the bootstrap value set to 100 to construct
   the mitochondrial phylogenetic tree.

   The genomes of Pantholops hodgsonii (GCF_000400835.1), Oreamnos
   americanus (GCA_009758055.1), Ovibos moschatus (GCA_021462335.1), Ovis
   ammon (GCA_028583565.1), Capra aegagrus (GCA_000978405.1), Capra hircus
   (GCA_015443085.1), C. milneedwardsii (serow.2 genome, GCA_032405125.2),
   Ovis aries (GCF_016772045.1), Capra sibirica (GCA_003182615.2), Ovis
   nivicola (GCA_903231385.1), Budorcas taxicolor (GCF_023091745.1),
   Hemitragus hylocrius (GCA_004026825.1), Pseudois nayaur
   (GCA_003182575.1), Ammotragus lervia (GCA_002201775.1), Ovis orientalis
   (GCA_014523465.1), and Rupicapra rupicapra (GCA_963981305.1) were
   separately aligned to the genome of Bos Taurus (GCF_002263795.1) using
   the LAST (v1409) (Kielbasa et al. [89]2011). The comparison results
   were then combined with Multiz (v11.2) (Blanchette et al. [90]2004) and
   filtered with Gblocks (v0.91b), retaining 4.31 million quadruple
   degenerate sites. A species tree was then constructed using RAxML‐NG
   (v1.2.0) (Kozlov et al. [91]2019) under the GTR + G model, with
   bootstrap replicates set to 100. The BASEML subroutine in the PAML
   (v4.10.7) package (Yang [92]2007) was used to calculate the neutral
   replacement rate (0.197217 + −0.000417). The calculated neutral
   replacement rate was used to calculate the divergence time for
   MCMCTREE, and the time calibration was performed on both nodes using
   the previously studied calibration time. The divergence between cattle
   and other animals is ‘> 18.3 < 28.5’ Mya, the divergence between goats
   and sheep is ‘> 3.9 < 8.1’ Mya (Yang [93]2007).

2.6. Chromosomal Evolution

   The LAST (version 1409) (Kielbasa et al. [94]2011) was used to align
   the goat genome (GCA_015443085.1) to the takin (GCF_023091745.1) and
   the assembled serow genome to the goat genome, respectively. Using
   MCScan (v1.3.6) (Tang et al. [95]2024), a fragment length is more than
   30,000 bp for filter, linear relation between visualization on two
   groups.

2.7. Heterozygosity Estimation

   The genome sequencing data was compared with the assembled genome using
   BWA‐MEM (v0.7.17). The SAMtools (v1.17) software (Li et al. [96]2009)
   was used to convert SAM files to BAM files. GATK (v4.1.8.1) software
   (Mckenna et al. [97]2010) was used to invoke and screen single
   nucleotide polymorphisms (SNPs). SNPs with mutation type 0/1 were
   screened out, and then the number of SNPs was compared with the length
   of the reference genome to obtain the heterozygosity of the genome.
   Heterozygosity of other species was obtained from previous research
   (Liu et al. [98]2021).

2.8. Positive Selection Gene, Rapid Evolution Gene Identification

   The CODEML subroutine in PAML was used to identify positive selection
   genes (PSGs) and rapid evolution genes (REGs) in the serow genome.
   Serow branches in phylogenetic trees were designated as foreground
   branches for independent positive selection analysis. The PSG was
   determined by the branch‐site model, where the alternative model
   allowed a positive selection of the site on the foreground branch. REG
   was tested with the branch model, where the alternative model allowed
   for different rates for different branches. Models allowing positive
   selection (dN/dS > 1) and empty models allowing neutral evolution and
   purification selection (dN/dS < 1) were compared using the likelihood
   ratio test (LRT). KOBAS (v3.0) software (Bu et al. [99]2021) and DAVID
   software were used for KEGG and GO pathway enrichment analysis of the
   identified positive selection genes and rapid evolution genes
   (p < 0.05).

2.9. Prediction of the Three‐Dimensional Structure of MYH6 Protein

   Download the amino acid sequence of the bovine MYH6 gene from NCBI, the
   software SMART (Simple Modular Architecture Research Tool)
   ([100]https://smart.embl‐heidelberg.de/) was used to predict the
   functional domain in the protein sequence. The nucleotide sequences of
   MYH6 gene were converted into amino acid sequences using the CODEML
   subroutine in PAML (v4.10.7) (Yang [101]2007). The converted amino acid
   sequences were subjected to multiple sequence alignment using the
   Clustal Omega (version 1.2.4) tool available in the EMBOSS software
   ([102]https://www.ebi.ac.uk/Tools/emboss/). Download the amino acid
   sequence of the human MYH6 gene from NCBI. Visualize the retrieved
   three‐dimensional structure from the database using PyMOL
   (version2.5.7) ([103]https://pymol.org/). Mutate the amino acids at
   positions 1520 and 1521 from E and G to S and S, respectively.
   Visualize the hydrogen bonds formed between the surrounding atoms and
   the two amino acids before and after the mutation. Utilize the software
   ProtScale ([104]https://web.expasy.org/protscale/) to analyze the
   hydrophobicity of the two positions before and after the mutation and
   their adjacent amino acids.

3. Results

3.1. Chromosome‐Level de novo Genome Assembly of Chinese serow

   To estimate the genome size of C. milneedwardsii, 120.10 Gb clean reads
   were used for k‐mer analysis. The genome size was estimated at 2.88 Gb
   with 0.43% heterozygosity (k = 17, Figure [105]S1). A total of 107.2 Gb
   HiFi reads (~37×) were generated by the PacBio Sequel II platform
   (Table [106]S1). We assembled a ~2.83 Gb reference genome with 240
   contigs and a contig N50 of 100.96 Mb using Hifiasm v0.19.5
   (Table [107]1).

TABLE 1.

   Assembly statistics of Chinese serow genome.
   Assembly                   Number/length
   Total assembly length (bp) 2,828,266,837
   Gap number                       6
   Number of contigs               165
   N50 contig length (bp)      100,955,055
   Contig L50                      11
   Number of scaffolds             158
   N50 scaffold length (bp)    112,748,084
   Scaffold L50                     9
   Chromosome number             23 + X
   GC content (%)                 43.07
   Protein‐coding genes          20,783
   [108]Open in a new tab

   To obtain a chromosome‐level assembly, 352.60 Gb Hi‐C reads (~120×)
   spanned to 252 scaffolds with a N50 of 112.75 Mb (Table [109]S2), which
   anchored onto 24 chromosomes (Figure [110]2A). The gap in Chr 7 was
   filled by TGS‐GapCloser v1.2.1. After deleting the mitochondrial genome
   sequences and polishing with pilon v1.23, the completeness of final
   assembled genome reached 95.90% with six gaps using
   cetartiodactyla_odb10 database in BUSCO v5.4.7 (Figure [111]2B,
   Table [112]S3). Compared to the references of other species, Chinese