Abstract

   Whole-genome single-nucleotide polymorphism (SNP) markers are valuable
   genetic resources for the association and conservation studies.
   Genome-wide SNP development in many teleost species are still
   challenging because of the genome complexity and the cost of
   re-sequencing. Genotyping-By-Sequencing (GBS) provided an efficient
   reduced representative method to squeeze cost for SNP detection;
   however, most of recent GBS applications were reported on plant
   organisms. In this work, we used an EcoRI-NlaIII based GBS protocol to
   teleost large yellow croaker, an important commercial fish in China and
   East-Asia, and reported the first whole-genome SNP development for the
   species. 69,845 high quality SNP markers that evenly distributed along
   genome were detected in at least 80% of 500 individuals. Nearly 95%
   randomly selected genotypes were successfully validated by Sequenom
   MassARRAY assay. The association studies with the muscle
   eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) content
   discovered 39 significant SNP markers, contributing as high up to ∼63%
   genetic variance that explained by all markers. Functional genes that
   involved in fat digestion and absorption pathway were identified, such
   as APOB, CRAT and OSBPL10. Notably, PPT2 Gene, previously identified in
   the association study of the plasma n-3 and n-6 polyunsaturated fatty
   acid level in human, was re-discovered in large yellow croaker. Our
   study verified that EcoRI-NlaIII based GBS could produce quality SNP
   markers in a cost-efficient manner in teleost genome. The developed SNP
   markers and the EPA and DHA associated SNP loci provided invaluable
   resources for the population structure, conservation genetics and
   genomic selection of large yellow croaker and other fish organisms.

   Keywords: Genotyping-By-Sequencing (GBS), Single nucleotide
   polymorphism (SNP), Teleost, Marker development, Large yellow croaker

Introduction

   Whole-genome single nucleotide polymorphism (SNP) is one of the most
   important genomic resources for population diversity, conservation
   genetics and functional gene identification for biological traits
   ([36]Seeb et al., 2011). To obtain the molecular markers of the shared
   genomic loci among individuals, many technologies were invented and
   developed to probe whole-genome polymorphisms. The techniques allowing
   synthesizing DNA probes in chips have led to the advent and application
   of SNP microarrays ([37]Lipshutz et al., 1999), making it possible to
   explore genome-wide SNP in a high-throughput manner. However the cost
   of array design and application obstructs the wider usage in non-model
   species, especially for endangered and economic organisms ([38]De
   Donato et al., 2013). More importantly, microarray approaches cannot
   discover novel SNP loci for species without reference sequences
   ([39]Popova et al., 2013). With the development of next-generation
   sequencing (NGS), the state-of-art sequencing platform enable
   scientists to scan small variants in genomes at an unprecedentedly
   scale with rapidly decreasing price. The multiplex library strategies
   were widely used to further reduce the cost per sample. However, the
   budget is still one of the biggest challenges for whole-genome
   resequencing in non-model samples ([40]Muir et al., 2016). Furthermore,
   the whole-genome sequencing data for hundreds of individuals also
   inevitably burdens the limited computational and bioinformatics
   capacity in labs.

   In the past few years, several robust sequencing-based genotyping
   techniques have been invented in the research community to overcome the
   bottle-neck of cost in whole-genome resequencing. Most of those
   innovations employ a strategy of partial genome representation
   sequencing ([41]Narum et al., 2013), such as restriction site
   associated DNA (RAD) ([42]Rowe, Renaut & Guggisberg, 2011), IIB
   restriction endonucleases based RAD (2bRAD) ([43]Wang et al., 2012) and
   Genotyping-By-Sequencing (GBS) ([44]De Donato et al., 2013). RAD
   applies a restriction enzyme to digest genome DNA and then random
   fragment them to generate RAD tags. Although RAD experiments was
   initially designed for microarray-based genotyping ([45]Miller et al.,
   2007), the updated RAD tag isolation and library construction procedure
   has been prevalently used to couple with high-throughput sequencing on
   the Illumina platforms, resulting many successful applications for
   genome-wide genotyping, genetic mapping, quantitative trait locus (QTL)
   and association studies ([46]Baird et al., 2008). However, RAD still
   depends on random fragmentations, reducing the consistence on SNP loci
   among samples. [47]Elshire et al. (2011) subsequently developed a more
   straightforward genotyping method as GBS with restriction enzymes of
   ApeKI in maize and barley. The protocols for GBS are simple, extremely
   specific and highly reproducible. In recent years, the easy
   transferability of GBS to other species leads to many application in
   plants ([48]Poland & Rife, 2012). One of the most attracting features
   of GBS is the using of methylation-sensitive restriction enzymes during
   libraries constructions to avoid repetitive fragments and to simplify
   the reads alignments in extremely complex genomes ([49]Elshire et al.,
   2011); therefore, GBS is an excellent whole-genome genotyping technique
   for complex non-model organism genomes with massive repetitive regions
   and abundant genetic diversities.

   Teleost, representing a large portion of fish species, has been showed
   to undergo the third round of whole-genome duplication (WGD) 370
   million years ago ([50]Braasch et al., 2016; [51]Xiao et al., 2015b).
   The extra WGD left a large portion of duplicated and repetitive
   sequences in teleost genomes ([52]Berthelot et al., 2014; [53]Jaillon
   et al., 2004), making the accurate whole-genome SNP marker development
   was still challenging in many teleost species ([54]Wang et al., 2008).
   We speculated that GBS technique provided an efficient way and was
   suitable for genotyping in teleost complex genome. However, the
   whole-genome SNP development and association studies based on GBS is
   rarely reported on teleost fish species. Large yellow croaker
   (Larimichthys crocea), belonging to the Sciaenidae family of teleost,
   is an important marine fish in China and East Asia ([55]Xiao et al.,
   2015a). Due to over-fishing and habitat degradation in last decades,
   the wild stock of the species has rapidly collapsed ([56]Liu, Mitcheson
   & Sadovy, 2008). The environmental changes and over-dense aquaculture
   pose more challenges on population conservation and sustainable
   development of the aquaculture for large yellow croaker. Whole-genome
   molecular markers and genome-wide association studies (GWAS) for
   important traits are prerequisites for the population conservation and
   genomic selection of the species ([57]Steiner et al., 2013). However,
   the association studies are rarely reported for large yellow croaker,
   largely because of the lacking of abundant stable genomic SNP markers.

   GBS technique provides the potential cost-efficient way for
   whole-genome SNP marker development in complex teleost genome. In the
   present investigation, we used large yellow croaker to verify the
   applicability of GBS on teleost. Two restriction enzymes of EcoRI and
   NlaIII based GBS protocol was developed and optimized. Massive
   whole-genome SNP markers were developed from the sequencing reads by
   bioinformatic pipelines, which were subsequently validated by
   Sequenom MassARRAY assay. The detected SNP markers in this work were
   then applied to the whole-genome association study of the muscle
   Eicosapentaenoic Acid (EPA) and Docosahexaenoic Acid (DHA) content in
   large yellow croaker. Our study confirmed the suitability of GBS on
   whole-genome SNP marker development in teleost genome. The developed
   whole-genome SNP markers and functional genes involved in muscle EPA
   and DHA contents offered valuable genetic resources for conservation
   genetics and genomic selection of large yellow croaker.

Materials and Methods

Ethics statement

   The sample collection and experiments in the study was approved by the
   Animal Care and Use Committee of Fisheries College of Jimei University
   (Animal Ethics no. 1067).

Sample preparation and DNA extraction

   The mixed reference population of 500 individuals was bred by a random
   fertilization of 30 males and 30 females at the large yellow croaker
   breeding base of Jimei University in Ningde, Fujian, China. All fish
   individuals were 1.5 year old with the total length and weight of
   24.5–25.9 cm and 217.8–234.1 g (95% confidence interval), respectively.
   To extract genomic DNA respectively from 500 individuals, the dorsal
   fins (20–30 mg) of the fish individuals were collected, frozen in
   liquid nitrogen for the following DNA extraction. Total genomic DNA was
   prepared in 1.5 ml microcentrifuge tubes containing 550 µl TE buffer
   (100 mM NaCl, 10 mM Tris, pH 8, 25 mM EDTA, 0.5% SDS and proteinase K,
   0.1 mg/ml). The samples were incubated at 55 °C overnight and
   subsequently extracted twice using phenol and then phenol/chloroform
   (1:1) method. DNA was precipitated by adding two and a half volumes of
   ethanol, collected by brief centrifugation, washed twice with 70%
   ethanol, air dried, re-dissolved in TE buffer (10 mM Tris–HCl, 1 mM
   EDTA, pH 7.5). DNA concentration and quality were estimated with an
   ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and by
   electrophoresis in 0.8% agarose gels with a lambda DNA standard.

In silico enzyme assessment for GBS library construction

   To assess how endonucleases influence the DNA fragment length
   distribution, four enzyme combinations were designed for in silico
   digestion of large yellow croaker genome: ApeKI-PstI, EcoRI-BstNI,
   EcoRI-NlaIII and PstI-NlaIII (NEB, Ipswich, MA, USA). The activation
   and heat-inactivation temperature, location in chromosome and length
   distribution of all fragments were analyzed to evaluation the best
   performance of each enzyme combination.

GBS library construction and sequencing

   The GBS libraries were constructed based on two DNA endonucleases:
   EcoRI (NEB, Ipswich, MA, USA) and NlaIII (NEB, Ipswich, MA, USA). A
   pilot GBS experiment was performed before the library construction to
   optimize the temperature and time parameters for yield, size
   distribution. Based on the pilot experiment, the GBS libraries of large
   yellow croaker based on EcoRI and NlaIII were constructed following the
   similar method in previous report ([58]Beissinger et al., 2013).
   Briefly, genomic DNA (20 ng/µl) was incubated at 37°C with EcoRI and
   NlaIII, 10XCutSmart™ Buffer. The restriction reactions were
   heat-inactivated at 65 °C by 20 min and were kept in 8 °C for the
   following experiments. Sequencing adaptor and barcode mix, T4 DNA
   Ligase, 10 mM ATP and 10XCutSmart™ Buffer were incubated at 16 °C for
   2 h for ligation reactions. The reactions were then heat-inactivated at
   65 °C by 20 min and the reaction systems were kept in 8 °C. Then,
   polymerase chain reactions (PCR) experiments were performed in the
   reaction solutions (20 µL) containing the diluted restriction/ligation
   samples (4 pM, 2 µL), dNTP (each at 10 mM, 5 µL), Taq DNA polymerase
   (NEB, Ipswich, MA, USA) (5 units/µL, 0.25 µL), Illumina Primers (each
   at 10 µM, 1 µL) and Indexing Primers (10 µM, 1 µL). The PCR procedure
   was: 95 °C 2 min; 15 cycle of 95 °C 30 s 60 °C 30 s, 72 °C 30 s; 72 °C
   5 min and kept in 4 °C. The PCR products were run on a 8%
   polyacrylamide gel electrophoresis. Fragments of 200–300 bp were
   isolated using QIAGEN QIAquick^® Gel Extraction Kit and diluted for
   pair-end sequencing on an Illumina HiSeq 2500 sequencing platform
   (Illumina, Inc, San Diego, CA, USA).

Sequencing read quality control and genotyping

   The raw sequencing reads generated by Illumina HiSeq 2500 from the GBS
   libraries were treated and cleaned for SNP detection. First, the
   adaptors were removed and the resulted reads were split by
   sample-specific barcode sequences. Only reads begins with the digest
   site sequences of EcoRI and NlaIII were retained for the following
   quality control. Second, the overall base and read quality were
   accessed by FastQC. To avoid the negative influence of ambiguous bases
   for SNP detection, reads with more than 5% of N were removed. Then, the
   resulted reads were cleaned by the following steps: (1) discarding the
   reads that the quality lower than 20; (2) deleting 5 bp windows in
   reads end that the average quality smaller than 20; (3) removing read
   pairs if one end was shorter than 50 bp.

   The cleaned reads were mapped to large yellow croaker genome by BWA
   0.7.6a ([59]Li & Durbin, 2009). The mapping was preceded by a short
   reads alignment with BWA-MEM algorithm. The alignment were then sorted
   by coordinates and duplicate marked by SortSam and MarkDuplicates
   programs in Picard tools 1.107 (picard.sourceforge.net), respectively.
   To reduce the false positives of SNP detection in this study, three
   processes were carried out: (1) short read mapping were re-aligned by
   local bases matches; (2) base Quality Score Recalibration (BQSR) was
   employed to adjust the accuracy of the base and mapping quality scores;
   (3) only reads pairs that both aligned on genome with a mapping score
   higher than 30 were used for SNP calling. Then, the SNP markers were
   detected by GATK UnifiedGenotyper utility.

SNP validation by Sequenom MassARRAY assay

   Genomic DNA was extracted from dorsal fin ray tissue as the method
   described before. PCR amplification was performed in the reaction
   system (5 µl total volume) containing 20 ng of genomic DNA, 0.5U
   HotstarTaq (Qiagen), 0.5 µl 10× PCR buffer, 0.1 µl dNTPs for each
   nucleotide and 0.5 pmol of each primer. All PCR experiments were
   carried out in a PTC-100 PCR instrument (Eppendorf) with the following
   program: 4 min denaturation at 94 °C, 35 cycles of 20 s at 94 °C, 30 s
   at 56 °C and 1 min at 72 °C and a final extension at 72 °C for 3 min.
   After the PCR products were cleaned using 2 µl SAP (SEQUENOM), the
   single base extension used 2 µl EXTEND Mix (SEQUENOM) contained 0.94 µl
   Extend primer Mix, 0.041 µl iPLEX enzyme and 0.2 µl iPLEX termination
   mix and performed with the following steps: initial denaturation at
   94 °C for 30 s, followed by 40 cycles of 3-step amplification profile
   of 5 s at 94 °C, additional 5 cycles of 5 s at 52 °C and 5 s at 80 °C
   and a final extension at 72 °C for 3 min.The PCR product was cleaned by
   resin purification and then analyzed using MassARRAY Analyzer Compac
   (SEQUENOM) and software TYPER (SEQUENOM).

   To evaluate the accuracy of SNP detection in this study, the genotypes
   from GATK SNP calling were compared with those from MassARRAY assay. If
   the genotypes of one SNP locus from GATK calling were identical with
   that in MassARRAY, then the locus was called a correct genotype. As a
   result, 1,421 of 1,500 SNP loci were correctly genotyped by GATK and
   the success rate of SNP calling was ∼94.7%. The specificity and
   sensitivity of SNP calling in the study were also evaluated. The
   reference homozygous genotypes (AA) both from MassARRAY and GATK were
   called true negatives, and the heterozygous genotypes or allelic
   homozygous (AB and BB) both from MassARRAY and GATK were called true
   positives. Specificity was then calculated as the number of true
   positives divided by the number of true positives plus the number of
   false positives, and the sensitivity was estimated as the number of
   true positives divided by the number of true positives plus the number
   of false negatives as the following formula:
   [MATH: <mstyle
   displaystyle="true"><mtext>Specificity</mtext><mo>=</mo><mfrac><mrow><m
   text>true negative</mtext></mrow><mrow><mtext>true
   negative</mtext><mspace width="1em"></mspace><mo>+</mo><mspace
   width="1em"></mspace><mtext>false
   positive</mtext></mrow></mfrac></mstyle> :MATH]
   [MATH: <mstyle
   displaystyle="true"><mtext>Sensitivity</mtext><mo>=</mo><mfrac><mrow><m
   text>true positive</mtext></mrow><mrow><mtext>true
   positive</mtext><mspace width="1em"></mspace><mo>+</mo><mspace
   width="1em"></mspace><mtext>false
   negative</mtext></mrow></mfrac><mo>.</mo></mstyle> :MATH]

Association analysis with the muscle EPA and DHA content

   From the 500 large yellow croaker population, 200 individuals were
   randomly selected for the muscle EPA and DHA content measurement for
   the following statistics and association analysis. The fat acid
   composition analysis followed the similar methods in previous reports
   ([60]Murillo, Rao & Durant, 2014). Briefly, the total lipid was
   extracted from the fresh muscle tissue using the chloroform-methanol
   method ([61]Folch, Lees & Sloane-Stanley, 1957). After saponification
   with 1 ml of 50% KOH in 15 ml ethanol, the lipid was then esterified in
   80 °C for 20 min using 6.7% boron trifluoride (BF3) in methanol (Morita
   chemical industries Co., Ltd., Osaka, Japan). After making up in hexane
   (20 mg/ml), fatty acid methyl esters (FAME) preparations were analyzed
   by gas chromatography (GC). The temperature increase of 170–260 °C at
   2 °C/min was set and helium was used as the carrier gas. Since the
   muscle contents of EPA and DHA were highly correlated, we combined
   those two components together in fish muscle.

   Prior to the association study, pairwise clustering based on the
   alleles shared identical by state (IBS) between any two individuals was
   performed to assess the genetic relatedness. The population
   stratification was analyzed by multi dimensional scaling (MDS)
   clustering methods available in the PLINK software ([62]Purcell et al.,
   2007).

   With the developed SNP markers, the association analysis was performed
   between genotypes and measured muscle EPA and DHA content using Plink
   1.07 A simple linear regression of phenotype on genotype was performed
   in the analysis. Markers with p-values ≤ 1e–4 were considered
   significantly associated with muscle EPA and DHA contents. To identify
   the biological functions of nearby genes and whether the orthologs of
   these significantly associated loci were also associated with the EPA
   and DHA content in other species, we identified the protein-coding
   genes around 50 kb of the significant SNP markers. We aligned the genes
   against the NCBI nr database by Blastx ([63]Altschul et al., 1997). GO
   term and KEGG pathway enrichment analysis of the associated genes were
   performed with Gene set enrichment analysis (GSEA) ([64]Shi & Walker,
   2007) by two-tailed Fish’s exact test with Benjamini & Hochberg false
   discovery rate (FDR) ([65]Benjamini & Hochberg, 1995) against the
   background of the all protein-coding gene in large yellow croaker
   genome. The additive genetic variances were estimated by using
   R-package EMMREML, Version 3.1.
   ([66]http://mirror.bjtu.edu.cn/cran/web/packages/EMMREML/index.html).

Results

Enzyme assessment and GBS construction for large yellow croaker

   According to the principles of the enzyme combination design for GBS
   library construction, four enzyme combinations were designed for the
   GBS analysis of large yellow croaker genome: ApeKI-PstI, EcoRI-BstNI,
   EcoRI-NlaIII and PstI-NlaIII (NEB, Ipswich, MA, USA). To assess the
   fragment size distribution and the number of potential SNP marker
   developed, the public large yellow croaker draft genome sequences
   ([67]Ao et al., 2015) were in silico digested by the four two-enzyme
   combinations to mimic the genomic fragmentation. As shown in the
   [68]Fig. 1, the predicted fragment numbers decreased with the fragment
   size for all enzyme combinations, but the ApeKI-PstI and EcoRI-BstNI
   lead to a large portion of fragments longer than 1 kb. According to the
   size distribution in [69]Fig. 1 and to make the fragment size more
   compatible to NGS sequencing, the genomic fragment with a size of
   100–300 bp were preferred; therefore, EcoRI-NlaIII and PstI-NlaIII were
   the rational combinations for the following library construction.
   According to our in silico experiments, genomic fragment in the range
   of 200–300 bp were used to construct GBS libraries (see ‘Method’ for
   details). By the assessment of the combination of EcoRI-NlaIII, roughly
   1.5 million fragments would be collected in libraries.

Figure 1. Fragment length distribution by restraint enzyme combination.

   [70]Figure 1
   [71]Open in a new tab

   Note that all fragments longer than 1 kb were accumulated in the last
   bar.

Library sequencing and reads mapping

   GBS libraries were constructed by the two enzyme based digestion (see
   ‘Method’ for the details). The NGS sequencing of GBS libraries for 500
   individuals generate roughly 314 Gb raw sequencing reads. To evaluate
   the raw data distribution among samples, we found that the majority of
   individuals (∼95%) had the raw sequencing reads ranged from 600 to 650
   Mb, indicating the excellent sequencing uniformity among samples from
   library construction and sequencing. The raw reads were cleaned by
   HTSeq to trim low quality ends (average quality < 20) and eliminate
   short reads (length < 50 bp). The cleaned reads were mapped to large
   yellow croaker reference genome sequences ([72]Ao et al., 2015) by BWA
   ([73]Li & Durbin, 2009). To assess the quality of GBS library, the
   mapped reads distribution of Sample 88 along the linkage groups were
   illuminated as an example ([74]Fig. S1). We found that reads were
   evenly covered all linkage groups of large yellow croaker, indicating
   an ideal representativeness of the libraries at the whole genome level.
   The covered loci depth distribution ([75]Fig. S2) showed that the
   majority of depth ranged from 5 to 20 and extreme reads enrichment on
   genome local regions were successfully avoided in sequencing libraries.

SNP discovery among samples in large yellow croaker genome

   To develop molecular markers based on the GBS library sequencing, SNP
   variants markers were detected from the reads alignments by GATK
   ([76]McKenna et al., 2010) pipelines (see ‘Method and Material’ for
   detailed information). To improve the quality of the detected SNP, we
   employed the extra reads local re-alignment and Base Quality Score
   Recalibration (BQSR) steps in SNP calling pipelines. Previous
   literatures on model organisms showed that those extra processes on
   reads alignment and SNP quality could significant reduce the false
   positives SNPs ([77]De Pristo et al., 2011; [78]McKenna et al., 2010;
   [79]Van der Auwera et al., 2013), therefore our refined bioinformatics
   pipeline coupled with the library construction provided a solid
   foundation for SNP detection in this study. As a result, 489,246 SNP
   markers were discovered in at least 200 large yellow croakers with a
   loci depth threshold of 3. It is not surprised that the majority of SNP
   markers were not shared by all samples because of the inherent DNA
   polymorphisms on enzyme digestion site in genome. The number of shared
   SNP markers among samples was crucial for the evaluation of the GBS
   sequencing of large yellow croaker genome, especially for the studies
   of QTL and GWAS analysis in populations. We further used depth- and
   population-based method to investigate the influence of loci depth and
   population size on the number of shared SNP marker. As we expected,
   both the population size and loci depth dramatically influenced the
   number of shared SNP markers ([80]Fig. 2). However, hundreds of
   thousands of the shared SNP markers were identified with a depth
   threshold of 5 in the study. According to previous literatures on SNP
   development in non-model organisms, the depth filtering of 5 provided
   high quality SNP markers for the genetic studies ([81]Hiremath et al.,
   2012; [82]Nguyen, Hayes & Ingram, 2014); therefore, our SNP calling
   based on GBS library developed sufficient SNP markers for the
   biological trait mapping and conservation genetics. We indeed found the
   sharp decreases on the number of shared SNPs for the population size
   from 450 to 500, which could be attributed to the samples with
   extremely low sequencing amount.

Figure 2. SNP number against sequencing depth and completeness.

   [83]Figure 2
   [84]Open in a new tab

   Note that only SNP loci witha quality score larger than 100 were used
   in the analysis.

   To control SNP marker quality while maximizing the number of shared
   samples and to facilitate the following GWAS analysis, markers with the
   loci depth higher than 5 and the shared in at least 400 individuals
   (90% of all sample) were used for the following analysis, resulting
   69,845 SNP markers in large yellow croaker genome. To answer the
   question if our sequencing data was sufficient for the whole-genome SNP
   development, the numbers of the detected genomic SNP markers were
   plotted against the sequencing data for each sample. As shown in
   [85]Fig. 3, the number of the discovered SNP marker increased with
   sequencing reads and remained to be ∼70,000 when the sequencing amount
   reached 600 Mb, implying that 600 Mb might be an optimal data amount
   for the trade-off the cost and SNP number in our large yellow croaker
   GBS libraries. The distribution of those SNP markers in 24 linkage
   groups ([86]Fig. 4) showed that those SNP markers were ideally evenly
   distributed in the genome, suggesting an excellent representation of
   whole genome markers in large yellow croaker. The location and
   functions of SNPs were investigated by comparing the locus coordinates
   with those of gene annotations. We found that ∼53 % of these SNPs were
   from genic regions, including exons (3,000), introns (27,114), and
   untranslated regions (UTRs, 9,166) ([87]Fig. 3). The detailed SNP
   categories in UTR revealed that 4,311 and 4,855 SNPs were from 5UTR and
   3UTR respectively. The biological functions of those SNP markers were
   analyzed according to their relative positions of the protein-coding
   genes. As in [88]Fig. 5, 866 SNPs markers in coding regions caused
   synonymous mutations. Of the remaining markers, 3,022 SNPs could lead
   to a change of amino acid and introduction of frame shift and new or
   lost start/stop codons. Those SNP markers might significantly alter the
   biological functions of the hosting genes and thus influence the
   biological traits that controlled by those genes.

Figure 3. SNP number against sequencing amount.

   [89]Figure 3
   [90]Open in a new tab

   The distribution of sequencingamount (top) and SNP marker number
   (right) were plot by sides. The line in the scatter is the smoothed
   curve cross all samples, and the grey area represent the 95% of the
   confidence region.

Figure 4. SNP distribution along chromosome.

   [91]Figure 4
   [92]Open in a new tab

   The lines along the chromosomes represent SNP loci. The SNP location in
   exon, intron, UTR and intergenic regions are showed by red, blue,
   purple and green, respectively.

Figure 5. Biological impact annotations of high quality SNP markers that
shared by at least 80% of the population with 500 large yellow croakers.

   [93]Figure 5
   [94]Open in a new tab

Experimental validation of detected SNP loci

   To assess the reliability of the SNP makers developed from the reduced
   representation libraries, 50 loci from 30 individuals were randomly
   selected to validate the marker polymorphism by the Sequenom MassARRAY
   assay. As shown in [95]Table 1, MassARRAY assay verified the most of
   detected SNP markers in those samples. Among 1,500 markers, 1,421 were
   validated by MassARRAY, confirming our library construction, sequencing
   and SNP marker calling pipelines. The primers for SNP validation and
   the detailed genotypes were listed in [96]Tables S1 and [97]S2,
   respectively. As shown in the [98]Table 1, the specificity and
   sensitivity for the SNP genotype detection in the present study were
   estimated as 94.2% and 98.3%, respectively. Notably, we found that the
   majority of discordant genotypes were heterozygous, which was
   consistent with the reports for other organisms ([99]Sonah et al.,
   2013). We attributed the error-prone genotypes in heterozygous markers
   to the fact that those markers need more supporting reads than their
   homozygous counterparts. However, the Sequenom MassARRAY assay still
   successfully validated ∼95% of the detected SNP marker developed by the
   GBS library sequencing, providing us solid SNP genotypes of the
   following trait association and other genetics studies for large yellow
   croaker.

Table 1. SNP validation by Sequenom MassARRAY assay.

   NNs indicate the failed genotypes during the SNP filtering.
     Genotypes      Sequenom MassARRAY assay
                  AA  AB  BB         NN
   SNP calling AA 901 0   2   0
               AB 54  404 7   0
               BB 0   2   116 0
               NN 11  1   2   0
   [100]Open in a new tab

The association study with the muscle EPA and DHA content

   To apply the genome-wide markers to probe potential marker and genes
   contributing to muscle EPA and DHA contents, 200 large yellow croakers
   reared with the identical feed in the same netcage were used to
   quantify EPA and DHA level. Muscle EPA and DHA contents in 176
   individuals were successfully extracted and measured. The contents
   exhibited a typical normal-like distribution (p-value of 0.94 with
   Kolmogorov–Smirnov test) with an average of 21.5 mg/g and a standard
   deviation of 4.1 mg/g ([101]Fig. S3). The difference of the highest and
   the lowest EPA and DHA contents was ∼13.8 mg/g.

   Before the association study, population stratification with MDS
   clustering showed that all samples were grouped into several clusters
   ([102]Fig. S4). 200 individuals used in the association study were
   distributed among population clusters, providing diversified genetic
   background for the following studies. From the MDS plot ([103]Fig. S4),
   the first and second dimension value (arbitrary value) were all smaller
   than 0.05, indicating homogeneous genotypes among samples. The
   association study of SNP marker with the muscle EPA and DHA content was
   performed with the linear model with a covariance to sex in Plink
   ([104]Purcell et al., 2007). 69,845 SNP loci developed above with depth
   threshold of 5 were used to perform the association study ([105]Fig.
   5). As shown in [106]Fig. 6, 39 markers from 11 linkage groups were
   exhibited significant association with the EPA and DHA content (p-value
   <  1e–4). Notably, many associated markers were significant by clusters
   in linkage group 4, 5 and 11, suggesting the credibility of the
   association studies. The results might also imply that many genes
   contributed to the muscle EPA and DHA levels in large yellow croaker.
   With the variance estimation by Restricted Maximum Likelihood (REML)
   method ([107]Smith & Graser, 1986), we found that those 39 significant
   markers could interpret as high up to ∼63.0% of genetic variance
   explained by all 69,845 markers.

Figure 6. GWAS analysis on the muscle EPA and DHA content and the functional
analysis therelated protein-coding genes.

   [108]Figure 6
   [109]Open in a new tab

   (A) The association results were illuminated in the Manhattan plot. The
   red line is the p-value threshold for significant markers; (B) KEGG
   pathway enrichment of functional genes; (C) GO term enrichment of
   related biological functions for the associated genes.

   To identify gene contributing to the muscle EPA and DHA content in
   large yellow croaker, we investigated the biological functions of
   protein-coding genes within 50 kb of all significant SNP markers
   (p-value <  1e–4). As a result, 122 genes were identified from the
   above association regions. The biological KEGG pathway and GO term
   annotations of the associated genes were enriched under the background
   of all protein-coding genes. The metabolic pathway of fat digestion and
   absorption was significant (FDR <  0.023) in the KEGG enrichment
   ([110]Fig. 6B and [111]Table S3). Meanwhile, GO terms of unsaturated
   fatty acid biosynthetic process, fatty acid derivative biosynthetic
   process and lipid transporter activity were also highlighted (FDR
   < 0.05) for the associated functional genes ([112]Fig. 6C). The
   detailed gene function GO annotations were summarized in [113]Table S4.
   We found that the many identified genes played important roles in lipid
   transport, metabolism and transcription regulation, such as
   apolipoprotein B (APOB), Carnitine O-acetyltrasferase (CRAT) and
   oxysterol binding protein 10 (OSBPL10). APOB is a crucial lipid
   transport protein in organism. Previous nutriology studies confirmed
   the correlation of EPA and DHA contents with APOB genotypes and gene
   expression ([114]Anil, 2007). Given their close relationship, we
   speculated that the polymorphisms on APOB gene might contribute to the
   EPA and DAH accumulation in large yellow croaker muscle. CRAT and
   OSBPL10 may also involved in the muscle EPA and DHA content since
   carnitine and oxysterol were important components and regulators in EPA
   and DHA synthesis pathways according to previous reports ([115]Qiu,
   2003; [116]Rise, Marangoni & Galli, 2002). Notably, we observed
   palmitoyl-protein thioesteraes 2 (PPT2) (around a maker with a p-value
   of 6.7e−06) as a potential functional gene contributing to muscle EPA
   and DHA contents. PPT2 gene was also identified by genome-wide
   association study on n-3 and n-6 polyunsaturated fatty acid levels in
   Chinese and European-ancestry populations ([117]Dorajoo et al., 2015;
   [118]Hu et al., 2016).

Discussions

   The advent and development of NGS have unprecedentedly prompted the
   application of the whole-genome marker development ([119]Seeb et al.,
   2011). Recently, SNP developments on genomic level were performed in
   many species, including livestock and fish in agriculture ([120]Sun et
   al., 2014). However, the cost for whole-genome re-sequencing is still
   one of the largest challenges in genomic marker developments. Based on
   NGS, GBS generally used multiple endonucleases to obtain the desired
   genomic length and the number of fragments to squeeze the sequencing
   cost ([121]De Donato et al., 2013; [122]Elshire et al., 2011;
   [123]Sonah et al., 2013), thus improving the specificity of marker
   detection along individuals; however, most of the GBS application were
   reported for plant genomes. Teleost, representing a large portion of
   fish species, has been showed to undergo the additional third round of
   whole-genome duplication (WGD) 370 million years ago. The extra genome
   duplication led to a large portion of duplicated and repetitive
   sequences in teleost genomes ([124]Sémon & Wolfe, 2007). GBS techniques
   provided an efficient way to probe polymorphism markers from complex
   genomes; however, the whole-genome SNP development and association
   studies based on GBS is rarely reported on teleost fish species. In
   this work, we used teleost large yellow croaker to verify the
   applicability of GBS on genomic marker development on teleost species.
   So far as we know, this is the first GBS implementation in large yellow
   croaker genome. The developed SNP markers provided useful resources for
   the following genetic studies, including population structure,
   conservation and functional gene mapping of important traits of the
   species. The enzyme combination and GBS protocols used in this study
   could also be valuable reference for other teleost species.

   Our in silico experiments mimicked the two enzyme digestion on large
   yellow croaker genome. After the detail investigation of the enzymes,
   we chose the combination of EcoRI and NlaIII for GBS protocol for two
   reasons. Firstly, EcoRI and NlaIII possessed the identical
   heat-inactivation temperature, which facilitated the pilot studies to
   optimize the experimental conditions for library construction; secondly
   and more importantly, EcoRI was sensitive when restriction site
   overlaps methylation sequence of CpG islands, therefore the using of
   the enzyme would partially avoid the digestion in repetitive regions.
   Many previous GBS libraries were constructed with the fragment length
   100–300 bp or even wider ([125]Elshire et al., 2011; [126]Sonah et al.,
   2013); however, we predicted ∼3 million fragment would be generated in
   that range. The large number of fragment implied a large amount of
   sequencing reads to cover those genomic regions, which would increase
   the unit-cost for the sequencing. To reduce the genomic fragments
   needed to be sequenced for libraries in this study, we attempted to
   narrow the length range to 200–300 bp, which was predicted to generate
   roughly 1.5 million genomic fragment for sequencing.

   Taking the SNP frequency of 1 per 1,000 bp ([127]Pushkarev, Neff &
   Quake, 2009), the library sequencing might result into roughly 300
   thousand SNP markers along large yellow croaker genome. Our estimation
   was based on the assumption that all individuals have no mutation on
   endonuclease digesting site and the read depth were high enough to
   cover SNP loci. However, because of the divergent genomic background
   among populations, it is very hard to detect all SNP markers that
   shared by all individuals. In this work, 489,246 raw SNP markers
   supported by more than three reads were detected with an average
   sequencing amount of 600 Mb in at least 200 individual from the 500
   large yellow croaker population. To facilitate the following marker
   association study and breeding practise, previous studies proposed
   several methods to filter the high quality SNP that shared in more
   individuals, such as depth-based ([128]Li et al., 2009), quality
   score-based ([129]Brockman et al., 2008) and population-based
   ([130]Bansal et al., 2010) manner. In this study, we employed a
   composite strategy for SNP filtering by simultaneously considering loci
   depth, marker quality and shared population size. As a result, 69,845
   SNP markers were left with a depth higher than 5, quality score higher
   than 100 and shared with at least 80% individuals (400 large yellow
   croakers). More than half (∼53%) of those detected quality SNP markers
   resided in genic regions, enabling us to probe the possible association
   of trait with the nearby functional genes. We noticed that the
   percentage of markers in genic regions was higher than that of previous
   reports in soybean (39.5%) ([131]Sonah et al., 2013) but lower than
   that of sweat cherry (65.6%) ([132]Guajardo et al., 2015). Those SNP
   markers generated from the reduced representation library, especially
   those in genic regions, provided us an easy and efficient manner to
   detect genomic small variants and to identify genomic regions related
   to important traits of large yellow croaker at genomic scale. The
   detected SNP markers were then validated by Sequenom MassARRAY assay
   for the randomly selected 50 loci in 30 individuals. Although the
   success rate (94.6%) was slight lower than that reported in the similar
   study in soybean (98%) ([133]Sonah et al., 2013), the library
   preparation protocol and bioinformatics pipeline provided us high
   quality genotypes on those SNP loci in the population for the following
   association studies.

   The successful applications of GWAS have greatly prompted the
   understanding to the genetic bases of important economic traits and
   would eventually benefit the artificial breeding and population
   conservation of non-model species ([134]Correa et al., 2015; [135]Narum
   et al., 2013) EPA and DHA are both omega-3 poly-unsaturated fatty acids
   that important in human physiology ([136]Swanson, Block & Mousa, 2012).
   Previous medical experiments demonstrated their positive effects on
   depressive symptoms in clinical trials ([137]Hoffmire et al., 2012) and
   the essential functions in brain development ([138]Brenna & Carlson,
   2014). Marine fish is a main source for human EPA and DHA supplement
   and nutritional properties of fish meat are highly dependent on
   polyunsaturated fatty acid levels; therefore the EPA and DHA content in
   muscle is one of the important indicants for the meat quality of fish.
   The genetic bases controlling EPA and DHA accumulation in fish species
   are highly interconnected and not fully revealed. Identifying key SNP
   loci and functional genes will increase our knowledge of molecular
   mechanism of polyunsaturated fatty acid synthesis and metabolism in
   marine fish. To the best of our knowledge, most of the previous
   researches were focus on the genetic variants on poly unsaturated fatty
   acid metabolism after fish oil supplements in human or gene expression
   and EPA and DHA level changes with different feed in fish ([139]Gregory
   et al., 2016; [140]Li et al., 2014; [141]Li et al., 2013;
   [142]Trushenski et al., 2012). The association studies aiming to
   identity potential functional genes contributing to EPA and DHA
   accumulation in fish meat is rarely reported.

   Among 176 individuals that were used to measure the muscle EPA and DHA
   level, the average muscle EPA and DHA content in the top 20 large
   yellow croakers (28.4 mg/g) was almost two-fold of that in the lowest
   20 ones (14.6 mg/g). Given that those fish were reared in the same cage
   and fed with the identical feed, there was a great potential to raise
   the muscle EPA and DHA content in large yellow croaker via genetic
   improvements. Using the developed quality SNP markers by GBS protocol,
   39 SNP markers from 11 linkage groups were observed to be significantly
   associated with muscle EPA and DHA levels. From the coordinates of gene
   and SNP loci, 122 protein-coding genes were identified around those
   significant markers. The functional analysis by homological searching
   found that many genes were involved in fat metabolism and transport,
   such as APOB, CRAT and OSBPL10. Unsaturated fatty acid biosynthetic
   process, fatty acid derivative biosynthetic process and lipid
   transporter activity and fat digestion and absorption pathway were
   significantly enriched in GO terms and KEGG pathways for the identified
   functional genes. Meanwhile, we observed large numbers of genes
   functions in cellular metabolism, gene expression and translation
   regulation, which may also play a role in modulating muscle EPA and DHA
   contents ([143]Tables S2 and [144]S3). Interestingly, we identified the
   potential functional gene of PPT2 gene in large yellow croaker that was
   previously discovered during the whole-genome association of plasma n-3
   and n-6 polyunsaturated fatty acid level in Asian and European
   populations ([145]Hu et al., 2016). The PPT2 gene in the linkage group
   5 of large yellow croaker might play a similar function in human and
   also contribute to the muscle EPA and DHA level. This result suggested
   that teleost fish and human may shared similar metabolic pathway for
   the polyunsaturated fatty acid synthesis and accumulation; however
   biological functions of PPT2 gene for the muscle EPA and DHA content in
   large yellow croaker and other vertebrates need further gene functional
   analysis.

Conclusions

   Teleost were widely believed to undergo the third round of WGD during
   the natural evolution; therefore, genomes of many teleost species were
   characterized by the complexity of high heterozygosity and repeat
   contents. In this work, EcoRI-NlaIII based GBS protocol was used to
   develop the whole-genome SNP markers in teleost large yellow croaker.
   The study verified the applicability of GBS on teleost species and
   provided useful references for GBS applications in other fish species.