Abstract

   Spiny head croaker (Collichthys lucidus), belonging to the family
   Sciaenidae, is a small economic fish with a main distribution in the
   coastal waters of Northwestern Pacific. Here, we constructed a
   nonredundant chromosome-level genome assembly of spiny head croaker and
   also made genome-wide investigations on genome evolution and gene
   families related to otolith development. A primary genome assembly of
   811.23 Mb, with a contig N50 of 74.92 kb, was generated by a
   combination of 49.12-Gb Illumina clean reads and 35.24 Gb of PacBio
   long reads. Contigs of this draft assembly were further anchored into
   chromosomes by integration with additional 185.33-Gb Hi-C data,
   resulting in a high-quality chromosome-level genome assembly of
   817.24 Mb, with an improved scaffold N50 of 26.58 Mb. Based on our
   phylogenetic analysis, we observed that C. lucidus is much closer to
   Larimichthys crocea than Miichthys miiuy. We also predicted that many
   gene families were significantly expanded (p-value <0.05) in spiny head
   croaker; among them, some are associated with “calcium signaling
   pathway” and potential “inner ear functions.” In addition, we
   identified some otolith-related genes (such as otol1a that encodes
   Otolin-1a) with critical deletions or mutations, suggesting possible
   molecular mechanisms for well-developed otoliths in the family
   Sciaenidae.

   Keywords: spiny head croaker, whole genome sequencing, chromosome-level
   genome assembly, otolith development, Sciaenidae

Introduction

   Spiny head croaker (Collichthys lucidus), belonging to the family
   Sciaenidae, is a small economic fish with a main distribution in the
   coastal waters of Northwestern Pacific ([39]Cheng et al., 2012a), from
   Philippines, China to Japan. With excellent properties and good meat
   quality, spiny head croaker has been favored by Chinese consumers with
   a high market value, whereas it has been overfished in the Yangtze
   estuary area of China ([40]Hu et al., 2015). Furthermore, Sciaenidae
   fishes are well known for their well-developed otoliths ([41]Xu et al.,
   2016), which are acellular crystalline mineral deposits in the inner
   ears of various teleost fishes ([42]Pracheil et al., 2019).

   Otoliths, mainly composed of calcium carbonate and organic matrix, play
   vital roles in sound sensing, balance, linear acceleration, and gravity
   in bony fishes ([43]Schulz-Mirbach et al., 2019). Moreover, otoliths
   are widely applied in fisheries sciences, such as evaluation of fish
   populations or population migration patterns, and are essential for
   paleoichthyological and archeological studies ([44]Barrett, 2019;
   [45]Barnett et al., 2020; [46]Heimbrand et al., 2020). Several critical
   genes, including Otoconin-90 (Oc90), secreted protein acidic and rich
   in cysteine (SPARC), SPARC-like1 (SPARCL1), otopetrin-1 (otop1),
   otolin-1 (otol1), and Otolithmatrixprotein-1 (OMP-1), have been
   identified to be related to otolith growth and formation in various
   vertebrates including bony fishes ([47]Hurle et al., 2003; [48]Murayama
   et al., 2005; [49]Kang et al., 2008; [50]Petko et al., 2008; [51]Xu et
   al., 2016b; [52]Hołubowicz et al., 2017). Nevertheless, a systematic
   screening of otolith-related genes in Sciaenidae species has not been
   reported yet.

   With the rapid development of genome sequencing technology and
   genome-based bioinformatics methods, studies on aquatic genomes and
   related applications, such as molecular breeding, drug development, new
   biomaterials, and DNA barcoding technology, have been accumulated
   ([53]Zhang et al., 2019; [54]Houston et al., 2020; [55]Li et al., 2020;
   [56]Long et al., 2020). Whole-genome sequencing (WGS) of about 90
   fishes has been published around the world by far ([57]Bian et al.,
   2019; [58]You et al., 2020). At present, genome studies on the
   Sciaenidae family have focused on the popular large yellow croaker
   (Larimichthys crocea) and its economic traits ([59]Wu et al., 2014;
   [60]Ao et al., 2015; [61]Gui et al., 2018; [62]Mu et al., 2018;
   [63]Chen et al., 2019), while there are only few reports on other
   Sciaenidae species such as spiny head croaker.

   Mitochondrial genome maps of spiny head croaker and candidate genes
   related to its sex determination have been examined by mitochondrial
   genome sequencing ([64]Cheng et al., 2012b), chromosome assembly
   ([65]Cai et al., 2019), and RNA sequencing ([66]Song et al., 2020).
   However, the genetic basis of well-developed otoliths in Sciaenidae
   species is still unknown. It is therefore necessary to explore genomic
   resources to gain insights into otolith development mechanisms and to
   accelerate genome-assisted improvements in biodiversity protection,
   breeding, and disease prevention of Sciaenidae species.

   In the past decades, Illumina short-read sequencing technology has been
   generally employed to assemble various fish genomes. However, with cost
   reduction of both short- and long-read sequencing technologies, more
   and more recent WGS projects have introduced PacBio long-read sequences
   in order to assemble high repetitive regions and improve assembly
   quality ([67]You et al., 2020). Here, we produced a nonredundant
   chromosome-level assembly of the spiny head croaker by combination of
   Illumina short reads, PacBio Single Molecule Real-Time (SMRT) long
   reads, high-throughput chromosome conformation capture (Hi-C) data, and
   transcriptome sequences. Moreover, we performed comparative genomics
   studies on candidate genes related to otolith development to figure out
   potential mechanisms for the well-developed otoliths in the family
   Sciaenidae.

Materials and Method

Sample Collection and Sequencing

   We extracted genomic DNAs from pooled muscle tissues of a wild female
   spiny head croaker and sequenced by using an Illumina HiSeq 2500
   sequencing platform (San Diego, CA, United States) and a PacBio Sequel
   sequencing platform (Menlo Park, CA, United States). The construction
   of DNA libraries (insert sizes of 500 and 800 bp for Illumina, and
   20 kb for PacBio) and subsequent sequencing were performed according to
   the standard protocols. In total, 52.48 Gb of raw Illumina data and
   five SMRT cells produced using the P6 polymerase/C4 chemistry,
   producing 35.24 Gb of PacBio long reads, were generated. After
   filtering by SOAPnuke (v.1.5.6; [68]Chen et al., 2017), we obtained
   49.12 Gb of Illumina clean data and 35.24 Gb of PacBio data for
   subsequent assembly.

   To acquire a chromosome-level assembly of the genome, genomic DNAs were
   fixed with formaldehyde and were sheared by a restriction enzyme (MboI)
   to build a Hi-C library, and then sequenced by an Illumina HiSeq X Ten
   platform. A total of 185.33 Gb of 150 PE Hi-C data were generated. All
   sequenced data generated in this study were deposited in the CNGB
   Nucleotide Sequence Archive under Program no. CNP0001197.

   We also extracted muscle RNA for transcriptome sequencing by using a
   HiSeq 2500 platform. Furthermore, to obtain the full-length transcript,
   the mixed RNA sample from 13 tissues was transcribed to generate
   full-length cDNA, and the SMRT bell library was constructed using the
   SMRT bell Template Prep Kit. The libraries were then prepared for
   sequencing on the PacBio Sequel sequencing platform.

Genome Assembly and Chromosome Assembly

   Firstly, we employed Kmerfreq
   ([69]https://github.com/fanagislab/kmerfreq) to estimate the genome
   size with 17-bp k-mers and applied GenomeScope (v1.0; [70]Vurture et
   al., 2017) to estimate genome heterozygosity. Subsequently, a hybrid
   genome assembly pipeline was employed to obtain genome assembly. Short
   Illumina reads were first assembled by using Platanus with “-m 300 -k
   27 -s 3” ([71]Kajitani et al., 2014), and DBG2OLC ([72]Ye et al., 2016)
   was performed to combine Platanus-generated contigs with PacBio reads
   to generate a hybrid contig assembly with default parameters. Pilon
   (v.1.225; ([73]Walker et al., 2014) was employed to polish the hybrid
   assembly. After then, redundancies of the primary assembly were removed
   by Redundans ([74]Pryszcz and Gabaldón, 2016) with “--identity 0.85
   --overlap 0.36.”

   We performed quality control of Hi-C raw reads and obtained valid
   Hi-C-connected reads by Juicer (v.1.5; [75]Durand et al., 2016). A 3D
   de novo assembly (3D-DNA, v.180922; [76]Dudchenko, et al., 2017)
   pipeline ([77]Dudchenko et al., 2017) was applied to anchor primary
   contigs into chromosome-level scaffolds. Completeness of the genome
   assembly was evaluated using by BUSCO v3.0 ([78]Simão et al., 2015)
   with “-l actinopterygii_odb9 -m genome -c 3 -sp zebrafish.”

De Novo Assembly of Transcriptomes

   We de novo assembled the RNA-seq reads using the Trinity assembler
   (v2.9.0; [79]Haas et al., 2013) and TGI clustering tool (TGICL;
   [80]Pertea et al., 2003). The PacBio ISO-Seq3 pipeline
   ([81]https://github.com/PacificBiosciences/IsoSeq) was used to obtain
   full-length non-chimeric (FLNC) transcripts via ccs, classify, cluster,
   and polish stage. FLNC was aligned with genome by minimap2.

Gene Prediction and Annotation

   Repetitive elements in the spiny head croaker genome were identified
   through a combination of homolog-based and de novo approaches. For the
   homolog-based method, RepeatMasker (v.4.0.7) ([82]Smit et al., 2019)
   and RepeatProteinMask (v.4.0.7; [83]Smit et al., 2019) were used to
   detect repeats by aligning against the Repbase database (v 21.0;
   [84]Bao et al., 2015). For the de novo method, LTRharvest
   ([85]Ellinghaus et al., 2008) was applied to predict full long terminal
   repeat (LTR) retrotransposons. RepeatModeler (v1.0.11; [86]Smit et al.,
   2019) was employed to build transposable element (TE) consensus
   sequences as a de novo TE library, and TRF (v.4.09; [87]Benson, 1999)
   was used to obtain tandem repetitive sequences. RepeatMasker was then
   used to discover and identify repetitive sequences with the combined
   library of the de novo TEs.

   Based on the repeat masked genome, we employed de novo, homology-based,
   and transcriptome-based prediction methods to annotate protein-coding
   genes in the assembled genome. Protein sequences of zebrafish (Danio
   rerio), three-spined stickleback (Gasterosteus aculeatus), Atlantic cod
   (Gadus morhua), channel catfish (Ictalurus punctatus), spotted gar
   (Lepisosteus oculatus), Nile tilapia (Oreochromis niloticus), fugu
   (Takifugu rubripes), downloaded from Ensembl ([88]Hunt et al., 2018),
   and large yellow croaker (L. crocea, GCF_000972845.2) from NCBI were
   aligned to the spiny head croaker genome by tBLASTn ([89]Kent, 2002)
   with “-e 1e-5.” Subsequently, GeneWise ([90]Birney et al., 2004) was
   used to predict gene structures from BLAST hits. Augustus (v3.3.1;
   [91]Stanke et al., 2006) was performed to predict de novo genes. We
   obtain 3,000 intact gene models generated from the homolog-based method
   randomly to train the parameters of AUGUSTUS then used AUGUSTUS to
   perform de novo prediction based on the repeat-masked genome with the
   training parameters. These gene sets that were predicted by different
   methods were integrated into a nonredundant gene set through the
   pipelines described in previous research ([92]Xiong et al., 2016).
   After that, the combined gene set was modified with transcriptome data
   through PASA (v2.3.3; [93]Haas et al., 2003).

   Gene functional annotation was performed based on consensus of sequence
   and domain. The protein sequences were aligned to NCBI Non-Redundant
   Protein Sequence (NR) databases, Kyoto Encyclopedia of Genes and
   Genomes (KEGG v89.0; [94]Kanehisa and Goto, 2000), SwissProt, and
   TrEMBL (Uniprot release 2020-06) ([95]Boeckmann et al., 2003) by BLASTp
   with “-e 1e-5.” The domains were searched and predicted by using
   InterProScan v5.11–55.0 ([96]Zdobnov and Apweiler, 2001) ([97]Jones et
   al., 2014) with publicly available databases including PANTHER
   ([98]Thomas et al., 2003), Pfam ([99]Bateman et al., 2004), PRINTS
   ([100]Attwood et al., 2000), ProDom ([101]Servant et al., 2002),
   PROSITE profiles ([102]Sigrist et al., 2010), and SMART ([103]Letunic
   et al., 2012). Gene ontology (GO) terms ([104]Ashburner et al., 2000)
   for each gene were predicted from the InterPro descriptions.

Genome Evolution and Gene Family Analysis

   In order to identify gene families in spiny head croaker, we collected
   protein sequences of the same species used for homologous annotation as
   well as miiuy croaker (Miichthys miiuy) and performed the TreeFam
   methodology ([105]Li et al., 2006) to obtain gene families of these
   species. We then used RaxML ([106]Stamatakis, 2006) to construct the
   phylogenetic tree by using the single copy orthologous gene families
   with the GTRGAMMA model.

   To identify the synteny between spiny head croaker and large yellow
   croaker, BLASTp was used to calculate pairwise similarities (e value <
   1e-5), and MCScanX package with default parameters was then used for
   classification. Then, JCVI was performed to generate visualization.

   A MCMCtree program in PAML (v4.9e; [107]Yang, 2007) was performed to
   estimate the divergence time between various species in the
   phylogenetic tree with the REV substitution model. Three calibration
   time points based on the TimeTree database
   ([108]http://www.timetree.org) were used as references (T. rubripes-G.