Abstract

   Yellowtail kingfish (Seriola lalandi) is a pelagic marine piscivore
   with a circumglobal distribution. It is particularly suitable for open
   ocean aquaculture owing to its large body size, fast swimming, rapid
   growth, and high economic value. A high-precision genome is of great
   significance for future genetic breeding research and large-scale
   aquaculture in the open ocean. PacBio, Illumina, and Hi-C data were
   combined to assemble chromosome-level reference genome with the size of
   648.34 Mb (contig N50: 28.52 Mb). 175 contigs was anchored onto 24
   chromosomes with lengths ranging from 12.28 to 34.59 Mb, and 99.79% of
   the whole genome sequence was covered. The BUSCOs of genome and gene
   were 94.20 and 95.70%, respectively. Gene families associated with
   adaptive behaviors, such as olfactory receptors and HSP70 gene
   families, expanded in the genome of S. lalandi. An analysis of
   selection pressure revealed 652 fast-evolving genes, among which mkxb,
   popdc2, dlx6, and ifitm5 may be related to rapid growth traits. The
   data generated in this study provide a valuable resource for
   understanding the genetic basis of S. lalandi traits.

   Keywords: Seriola lalandi, genome, adaptation, rapid growth,
   aquaculture

Introduction

   To develop environmentally friendly and economically sustainable
   aquaculture, it is necessary to understand the genetic basis of traits
   that currently limit/enhance development of domestic aquaculture
   ([48]Rondeau et al., 2013). Genetic resources have been developed and
   widely used in agriculture and animal husbandry for decades, but only
   recently have they been used in selected aquaculture species ([49]Ozaki
   et al., 2013; [50]Dunham et al., 2014). There is still limited
   information on genetic variation on commercially important traits
   ([51]Peterson et al., 2020). The methods used to develop these
   resources offer the best possibilities for genetic improvement or
   culture practices ([52]Sodeland et al., 2013). Third-Generation
   Sequencing (TGS) has improved this area of research through high
   quality assemblies and decreasing costs, and this has enabled
   development of genetic resources for a greater number of species
   ([53]Huete-Pérez and Quezada, 2013; [54]Lee et al., 2016; [55]Lv et
   al., 2020).

   Yellowtail kingfish (Seriola lalandi) is an excellent marine economic
   fish. It has a number of beneficial traits for open ocean aquaculture
   systems, including large body size, rapid growth, and high-quality
   flesh ([56]Orellana et al., 2014; [57]Sanchís-Benlloch et al., 2017).
   Similar in taste to tuna or mackerel, yellowtail kingfish have a large
   market worldwide and are a popular fish used in sushi ([58]Purcell et
   al., 2015). These make them a good candidate for aquaculture. Since the
   1990s, extensive research in Japan has focused on artificial breeding
   and breeding technology for S. lalandi ([59]Sano, 1998). In China,
   aquaculture of S. lalandi began in 2001 ([60]Jiang et al., 2001), along
   with biological research, including studies of embryogenesis, seedling
   cultivation, and effects of salinity stress on growth ([61]Shi et al.,
   2019; [62]Xu et al., 2019; [63]Liu A et al., 2021).

   Here, we report a chromosomal-level genome assembly of S. lalandi. Our
   evolutionary and comparative genomic analysis provide insights into the
   adaptability of the species to the external environment. Furthermore,
   the genome analysis provide a valuable resource for further studies of
   the genetic basis of traits of S. lalandi.

Value of the Data

   This is the first chromosomal-level genome assembly in Seriola genus.
   It could be a valuable resource to conduct a comparative analysis among
   the species in the genome of the Seriola genus and for further studies
   of the genetic basis of traits of S. lalandi.

Materials and Methods

Sampling and Sequencing

   Yellowtail kingfish specimens were collected from Dalian Futai Marine
   Products Farming Co., Ltd. (Dalian, China). Total genomic DNA of a male
   fish muscle sample was extracted using the QIAamp DNA Mini Kit (QIAGEN,
   Hilden, Germany) following the manufacturer’s protocols. We constructed
   two paired-end libraries (insert sizes of 200 and 500 bp) following the
   manufacturer’s protocol (Chromium Genome v1, PN120229). The libraries
   were sequenced on the BGISEQ-500 platform to obtain PE 2 × 150 bp
   reads. The extracted DNAs were also used to construct a 20 kb library
   following the PacBio protocol (Pacific Biosciences, Menlo Park, CA,
   United States). The libraries were then sequenced on the PacBio Sequel
   platform. We obtained 48.74 and 106.76 Gb of raw sequence data using
   the BGISEQ-500 and PacBio platforms, respectively ([64]Supplementary
   Table 1).

   To construct chromosome-level assemblies, the Hi-C technique was used.
   A Hi-C library was prepared following the strategy described by Rao et
   al. ([65]Rao et al., 2014) using blood samples with an ∼300 bp insert
   size. Using the BGISEQ-500 platform to sequence the Hi-C library, we
   obtained 87.60 Gb of raw Hi-C data ([66]Supplementary Table 1).

   Four tissues (brain, pituitary, liver, and muscle) were collected for
   RNA sequencing. RNA from each tissue was extracted and treated with
   DNase I (TAKARA, Kusatsu, Japan) to remove genomic DNA. For each
   tissue, a paired-end RNA-sequencing library was constructed with an
   insert size of 300 bp and then sequenced on the Illumina HiSeq 2,500
   platform to generate PE 2 × 150 bp. One muscle specimen was also used
   to construct an Iso-Seq library and then sequenced on the PacBio Sequel
   platform. In total, we obtained 307.14 and 26.89 Gb of raw sequence
   data using the Illumina HiSeq 2,500 and PacBio platforms, respectively
   ([67]Supplementary Table 1).

Genome Assembly, Chromosome Anchoring, and Genome Annotation

   Before genome assembly, we estimated the genome by a k-mer analysis
   using Jellyfish v2.2.6 ([68]Marçais and Kingsford, 2011). For this, a
   series of k-mers (17, 19, and 21) were extracted from the BGISEQ-500
   sequencing data and the frequency of each kmer was calculated. The
   heterozygosity rate was estimated using 17-mers using GenomeScope
   v2.0.0 ([69]Supplementary Figure 1). Considering the C-value (0.7) from
   the Animal Genome Size Database, the estimated genome size of S.
   lalandi was 684.60 Mb.

   Canu v1.8 was used for the self-correction of long reads sequenced with
   the PacBio Sequel platform. Then, the corrected reads were assembled
   using wtdbg2 v2.5 (options: -x rs -g 750 m) ([70]Ruan and Li, 2020).
   Pilon v1.23 ([71]Walker et al., 2014) was used to polish contigs with
   short reads by three rounds of alignment. The Hi-C short reads were
   aligned to the scafiolds using Juicer ([72]Durand et al., 2016) and
   anchoring was performed using 3D-DNA v180419 ([73]Dudchenko et al.,
   2017). We finally used Juicebox Assembly Tools v1.9.9 ([74]Durand et
   al., 2016) to correct the connections. The completeness of the final
   assembly was assessed using BUSCO v.4.0 ([75]Simão et al., 2015).

   Both homology-based and de novo predictions were used to annotate
   repetitive sequences. Transposable elements were identified using
   RepeatMasker v4.0.7 ([76]http://www.repeatmasker.org) and
   RepeatProteinMask v1.36 with Repbase v17.01 ([77]Bao et al., 2015). A
   de novo transposable element library was constructed using
   RepeatModeler v1.0.11 ([78]http://www.
   repeatmasker.org/RepeatModeler.html) and was then used to predict
   repeats using RepeatMasker.

   To annotate gene structures, we used homology-based prediction,
   transcriptome-based prediction, and de novo prediction. For
   homology-based annotation, the protein sequences of eight teleost
   species downloaded from NCBI, including Seriola lalandi dorsalis,
   Seriola dumerili, Seriola quinqueradiata, Seriola rivoliana, Echeneis
   naucrates, Oryzias latipes, Danio rerio, and Takifugu rubripes, were
   aligned to the genome assembly by BLAT v3.6 ([79]Kent, 2002), and then
   GENEWISE v2.4.0 ([80]Birney et al., 2004) was used to predict gene
   structures. For next-generation RNA-sequencing annotation, data were
   aligned to the genome assembly using HISAT2 v2.1.0 ([81]Kim et al.,
   2015) and the alignments were fed to StringTie v1.3.5 ([82]Pertea et
   al., 2015) to assemble the transcriptome. TransDecoder v5.0.2
   ([83]https://github.com/TransDecoder/TransDecoder/) was used to
   predicate ORFs and identify candidate gene structures. For
   third-generation RNA-sequencing annotation, long-read RNA-seq (PacBio
   Iso-Seq) transcripts were obtained by removing the redundant sequences
   using cd-hit-est v4.8.1 ([84]Li and Godzik, 2006). Then, the
   non-redundant transcripts were mapped to the genome by BLAT and
   assembled using PASA v2.0.2
   ([85]https://github.com/PASApipeline/PASApipeline/). For de novo
   prediction, the gene structures were analyzed on the repeat-masked
   genome assembly using AUGUSTUS v2.5.5 ([86]Stanke et al., 2006),
   GlimmerHMM v3.0.4 ([87]Allen et al., 2006), and GENSCAN ([88]Burge and
   Karlin, 1998). Finally, genes predicted from the above methods were
   merged to obtain a consensus gene set using Evidence Modeler (EVM). For
   the functional annotation of the gene sets, the protein sequences of
   these genes were aligned against sequences in public protein databases,
   including, NR, KEGG, SwissProt, GO, InterPro, and Trembl, to identify
   homologues using Blastp v2.2.26 with an E-value cutoff of 1e-5.

Phylogenetic Analysis and Gene Family Expansion

   To determine single-copy genes of S. lalandi and other species (S.
   dumerili, S. quinqueradiata, S. rivoliana, E. naucrates, O. latipes, D.
   rerio, T. rubripes, Larimichthys crocea, Oreochromis niloticus, and
   Caranx melampygus), the TreeFam pipeline ([89]Li et al., 2006) was
   used. Before generating the alignment, the longest transcript of each
   gene was selected and protein sequences shorter than 50 amino acids
   were filtered out. Then, Blastp searches were performed for all protein
   sequences with an E-value cut-off of 1e-5, and fragmented alignments
   were merged using SOLAR. Hcluster was used to filter segments, group
   genes, and determine single-copy orthologue families. The phylogenetic
   tree was inferred using multiple alignments from the single-copy genes
   using RaxML-ng v0.9.0 ([90]Kozlov et al., 2019) under the
   site-heterogeneous GTR + G4 model with maximum likelihood estimation
   (ML).

   An ultrametric tree was inferred using r8s v1.71 with fossil records
   from the TimeTree website ([91]http://www.timetree.org) for
   calibration. An MCMCTREE analysis implemented in PAML v4.5 ([92]Yang,
   1997) was employed to estimate divergence times. CAFÉ v5.0 ([93]De Bie
   et al., 2006) was used to assess gene family size dynamics, and
   families with p < 0.05 showed significant expansion or contraction. GO
   and KEGG pathway enrichment analysis were used to analyze the expanded
   and contracted genes.

Positive Selection Analysis

   To identify positively selected genes (PSGs), we re-determined
   single-copy orthologues shared among five species (E. naucrates, T.
   rubripes, O. latipes, D. rerio, and S. lalandi) and constructed a
   phylogenetic tree. Based on the new phylogenetic tree and single-copy
   genes, we estimated the rate ratio (ω) of non-synonymous to synonymous
   nucleotide substitutions using CodeML (PAML package) to examine
   selective constraint. After obtaining high-quality alignments using
   prank v.100802 ([94]Löytynoja and Goldman, 2010), Gblocks v0.91b
   ([95]Castresana, 2000) was used to eliminate poorly aligned positions
   and divergent regions. Finally, the signature of positive selection (d
   [N ]/d [S ]> 1) was identified using the PAML branch site model. GO and
   KEGG pathway enrichment analysis were used to evaluate PSGs.

Results and Discussion

Genome Features

   To generate a high-quality reference genome, we combined PacBio,
   Illumina, and Hi-C data ([96]Supplementary Table 1). PacBio CLRs with
   coverage of 165 × were used for genome assembly. The draft assembly was
   648.34 Mb, with 277 contigs, a contig N50 of 28.52 Mb, and a GC content
   of 40.79% ([97]Supplementary Table 2). Using ∼52 Gb (∼87×) of valid
   Hi-C data, we anchored 175 contigs onto 24 chromosomes ([98]Figure 1A,
   [99]Supplementary Figure 2) ([100]Shi et al., 2017). The lengths of the
   24 chromosomes ranged from 12.28 to 34.59 Mb, and 99.79% of the whole
   genome sequence was covered ([101]Supplementary Table 3). To evaluate
   the completeness of the assembly, the BUSCO database
   (actinopterygii_odb10) and RNA-seq data were used. The genome contained
   94.20% complete BUSCOs and the average mapping rate of transcriptome
   data was 96.30% ([102]Supplementary Table 4). The published Trachinotus
   ovatus chromosome-level genome was used to validate the accuracy of the
   assembly of the chromosomes ([103]Zhang et al., 2019); 567.01 MB
   synteny blocks (each synteny block > 500 bp) were consistent with the
   assembled chromosomes ([104]Figure 1B).

FIGURE 1.

   [105]FIGURE 1
   [106]Open in a new tab

   Genome assembly and comparison. (A) Circos graph of genome statistics.
   Genomic features. From outer to inner circles: 1, represents
   chromosomes; 2, distribution of DNA transposons; 3, distribution of
   retrotransposons; 4, GC content; 5, gene distribution density; 6, each
   line joins paralogous genes at different chromosomes. 2–5 are drawn
   with 500 Kb sliding windows. (B) Genome comparison between S. lalandi
   and T. ovatus. The S. lalandi chromosomes are on the left, and the T.
   ovatus chromosomes are on the left.

   Repetitive elements comprised 22.46% of the S. lalandi genome, similar
   to the estimate in the T. ovatus genome (20.25%, 655 Mb) ([107]Zhang et
   al., 2019). The most abundant transposable elements (TEs) were DNA
   transposons (11.51%), followed by long terminal repeats (LTRs, 4.93%)
   and long interspersed elements (LINEs, 3.85%) ([108]Supplementary
   Figure 4). We integrated de novo, homology-based and
   transcriptome-based methods to predict a protein-coding gene set
   comprising 22,674 genes ([109]Supplementary Table 5), and which 20,568
   (90.71%) matched entries in a public database ([110]Supplementary Table
   6). We identified 95.70% complete BUSCOs from 22,674 protein-coding
   genes.

Phylogenetic Relationships and Genomic Comparison

   We constructed a phylogenetic tree of S. lalandi and 10 teleost fish
   (S. dumerili, S. quinqueradiata, S. rivoliana, E. naucrates, O.
   latipes, D. rerio, T. rubripes, L. crocea, O. niloticus, and C.
   melampygus) based on 5,067 single-copy genes ([111]Figure 2A,
   [112]Supplementary Table 7). According to the phylogeny and the fossil
   record of teleosts, we dated the divergence of Seriola from the other
   teleost species to approximately 72.6 million years ago ([113]Figure
   2A).

FIGURE 2.

   [114]FIGURE 2
   [115]Open in a new tab

   Genome evolution analysis. (A) Phylogenetic tree of 11 teleost genomes,
   which was constructed using 5,067 single copy orthologous genes. The
   black numbers on the branches indicate the estimated diverge times in
   millions of years ago, and the blue and red numbers represent the
   expanded and contracted gene families. The different types of
   orthologous relationships are shown on the right. (B) The enrichment
   analysis of 148 positively selected genes detected in S. lalandi
   genome.

   We detected 56 significantly expanded and 1,073 significantly
   contracted gene families (p < 0.05) in S. lalandi ([116]Figure 2A).
   Compared with teleost fish except of Seriola, the HSP70 family with 19
   HSP70 genes was expanded ([117]Supplementary Table 8). We found five
   hspa12 genes, including hspa12a, hspa12b, hspa12l-1, hspa12l-2, and
   hspa12l-3, which was more than observed in E. naucrates (2), O. latipes
   (2), D. rerio (2), T. rubripes (2), C. melampygus (3), and L. crocea
   (3) ([118]Liu et al., 2019). In S. lalandi, there were three hspa12l
   gene copies. HSP70 is a well-known stress protein ([119]Clark and Peck,
   2009), and the expansion of the HSP70 family in S. lalandi may
   contribute to its adaptation to changes in the aquatic environment.

   Yellowtail kingfish is a migratory marine fish with high olfactory
   sensitivity ([120]Martínez-Montaño et al., 2016). We identified 147
   olfactory receptor (OR)-like genes from the S. lalandi genome,
   including subfamily "Delta" (68), "Eta" (49), "Zeta" (12), "Epsilon"
   (9), "Beta" (6), "Thet" (2), and "Kappa" (1) ([121]Supplementary Table
   9). The expanded subfamilies "Delta" and "Epsilon" are important for
   the perception of water-soluble odorants ([122]Cong et al., 2019). Most
   teleosts possess one or two “Beta” OR genes, which are important for
   detecting both water-soluble and airborne odorants ([123]Liu H et al.,
   2021). However, subfamily "Beta" of olfactory receptor was expanded in
   S. lalandi. These expansions may contribute to the olfactory detection
   ability of the species, which could be useful for feeding and migration
   ([124]Bett and Hinch, 2016).

Fast-Evolving Genes in Yellowtail Kingfish

   PSGs are often associated with adaptive evolution and may contribute to
   new or improved functions. To understand the selective pressure
   operating on S. lalandi, we compared the orthologues of five teleost
   species (E. naucrates, T. rubripes, O. latipes, D. rerio, and S.
   lalandi) and identified 652 fast-evolving genes, including 148 PSGs (d
   [N ]/d [S ]> 1) and 504 genes that contain positively selected sites in
   S. lalandi ([125]Supplementary Table 10). Consistent with the large
   body size and fast swimming ability, an enrichment analysis revealed
   that the PSGs were involved in striated muscle tissue development
   (GO:0014706), regulation of actin cytoskeleton (dre04810), and fatty
   acid metabolism (dre01212) ([126]Figure 2B).

   Muscle tissue development is associated with the growth rate, which is
   a major economic trait in animal production. Several genes involved in
   muscle tissue development (klf2a, klhl41b, cdk9, ndrg4, mkxb, and
   popdc2) showed rapid evolution in S. lalandi and likely contribute to
   the rapid growth of the species ([127]Supplementary Table 10). Fast
   growing muscles also require increased bone support. Two genes, dlx6
   and ifitm5, were involved in skeletal system development and promote
   bone formation to support the large body ([128]Supplementary Table 10).
   Based on the strong muscle and skeletal systems, muscle
   contraction-related genes (arhgef12b, ramp2, tnnt2a, tnni1a, cald1a,
   and tnnt2a) with positively selected sites may provide support for fast
   swimming ([129]Supplementary Table 10). Furthermore, fatty acid
   metabolism-related fast-evolving genes (hsd17b12b, acadm, mecr, lipg,
   and hao1) also contributed to energy consumption and growth
   ([130]Supplementary Table 10). Moreover, some genes with positively
   selected sites were associated with growth (ficn, tgfbr3, igf1ra,
   gpc1b, rnf11, and tgfbrap1) ([131]Supplementary Table 10).

   We identified other fast-evolving genes, such as the pheromone response
   gene ora2, ear development gene ddt, and sensory perception gene ppef1,
   with potential roles in the perception of the external environment
   ([132]Supplementary Table 10). Fast-evolving genes involved in nervous
   system development and the regulation of neurotransmitter secretion
   (rab33a, fstl5, and cplx2) provided a tissue basis for sensitive
   sensory systems ([133]Supplementary Table 10). The fast-evolving genes
   ldha and rnf152 may contribute to adaptation to adverse environmental
   conditions, such as hypoxia and starvation ([134]Supplementary Table
   10).

Conclusions

   We sequenced and assembled the genome of S. lalandi using Illumina
   shotgun, PacBio SMRT, and Hi-C data, providing the first
   chromosome-level genome assembly for the genus Seriola. Basing on
   multiple annotation strategies, we obtained 22,674 protein-coding genes
   with minimal redundancy. Further genomic analysis revealed gene
   families associated with expansions of HSP70 and olfactory receptor
   gene families, and the rapid evolution of muscle and skeletal system
   development genes, providing insight into the genetic basis underlying
   the physiological characteristics of S. lalandi and its adaptability to
   the external environment. We believe these new resources will promote
   genetic research and accelerate the genetic breeding process for S.
   lalandi.

Data Availability Statement

   The datasets presented in this study can be found in online
   repositories. The names of the repository/repositories and accession
   number(s) can be found below: GenBank,
   [135]JAIQDC010000001.1-[136]JAIQDC010000031.1; NCBI BioProject,
   PRJNA754209.

Ethics Statement

   The animal study was reviewed and approved by Experimental Animal Care,
   Ethics and Safety Inspection Form Yellow Sea Fisheries Research
   Institute, CAFS. Written informed consent was obtained from the owners
   for the participation of their animals in this study.

Author Contributions

   LX, CS, and XY conceived and designed the experiments. BW, AC, and YJ
   collected, identified, and photographed the specimens. SL, KL, and XH
   analyzed the genome and transcriptome data. WH, WQ, and BF performed
   gene analysis. SL drafted the manuscript. CS, YX, and ZL provided
   advice on manuscript writing. All authors reviewed the manuscript.

Funding

   This work was supported the National Key R&D Program of China
   (2018YFD0900301, 2019YFD0900901, 2018YFD0901204); the Marine S&T Fund
   of Shandong Province for Pilot National Laboratory for Marine Science
   and Technology (Qingdao) (2018SDKJ0303-1); the Central Public-interest
   Scientific Institution Basal Research Fund, CAFS (No.2020TD19,
   2020TD47, 2021GH05); the Taishan Scholar Project Fund of Shandong of
   China; the China Agriculture Research System of MOF and MARA (CARS-47).

Conflict of Interest

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Publisher’s Note

   All claims expressed in this article are solely those of the authors
   and do not necessarily represent those of their affiliated
   organizations, or those of the publisher, the editors and the
   reviewers. Any product that may be evaluated in this article, or claim
   that may be made by its manufacturer, is not guaranteed or endorsed by
   the publisher.

Supplementary Material

   The Supplementary Material for this article can be found online at:
   [137]https://www.frontiersin.org/articles/10.3389/fgene.2021.825742/ful
   l#supplementary-material
   [138]Click here for additional data file.^ (10.1KB, xlsx)
   [139]Click here for additional data file.^ (10.5KB, XLSX)
   [140]Click here for additional data file.^ (16.7KB, docx)
   [141]Click here for additional data file.^ (14.8KB, XLSX)
   [142]Click here for additional data file.^ (90.9KB, TIF)
   [143]Click here for additional data file.^ (1.7MB, TIF)
   [144]Click here for additional data file.^ (591.8KB, TIF)
   [145]Click here for additional data file.^ (9.9KB, XLSX)
   [146]Click here for additional data file.^ (11.5KB, XLSX)
   [147]Click here for additional data file.^ (9.7KB, XLSX)
   [148]Click here for additional data file.^ (102.8KB, XLSX)
   [149]Click here for additional data file.^ (10KB, XLSX)
   [150]Click here for additional data file.^ (10.4KB, xlsx)
   [151]Click here for additional data file.^ (10KB, XLSX)

Abbreviations

   N50, median size; Gb, gigabase pairs; Mb, megabase pairs; kb, kilobase
   pairs; PacBio, Pacific Biosciences; SMRT, single molecule, real-time;
   BUSCOs, Benchmarking Universal Single-Copy Orthologs; GC,
   guanine-cytosine; GO, Gene Ontogeny; KEGG, Kyoto Encyclopedia of Genes
   and Genomes; Mb, megabase pairs; ML, maximum likelihood; MYA, million
   years ago; NCBI, national center for biotechnology information; HSP70,
   heat shock protein 70 family.

References