Abstract
Whole-genome single-nucleotide polymorphism (SNP) markers are valuable
genetic resources for the association and conservation studies.
Genome-wide SNP development in many teleost species are still
challenging because of the genome complexity and the cost of
re-sequencing. Genotyping-By-Sequencing (GBS) provided an efficient
reduced representative method to squeeze cost for SNP detection;
however, most of recent GBS applications were reported on plant
organisms. In this work, we used an EcoRI-NlaIII based GBS protocol to
teleost large yellow croaker, an important commercial fish in China and
East-Asia, and reported the first whole-genome SNP development for the
species. 69,845 high quality SNP markers that evenly distributed along
genome were detected in at least 80% of 500 individuals. Nearly 95%
randomly selected genotypes were successfully validated by Sequenom
MassARRAY assay. The association studies with the muscle
eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) content
discovered 39 significant SNP markers, contributing as high up to ∼63%
genetic variance that explained by all markers. Functional genes that
involved in fat digestion and absorption pathway were identified, such
as APOB, CRAT and OSBPL10. Notably, PPT2 Gene, previously identified in
the association study of the plasma n-3 and n-6 polyunsaturated fatty
acid level in human, was re-discovered in large yellow croaker. Our
study verified that EcoRI-NlaIII based GBS could produce quality SNP
markers in a cost-efficient manner in teleost genome. The developed SNP
markers and the EPA and DHA associated SNP loci provided invaluable
resources for the population structure, conservation genetics and
genomic selection of large yellow croaker and other fish organisms.
Keywords: Genotyping-By-Sequencing (GBS), Single nucleotide
polymorphism (SNP), Teleost, Marker development, Large yellow croaker
Introduction
Whole-genome single nucleotide polymorphism (SNP) is one of the most
important genomic resources for population diversity, conservation
genetics and functional gene identification for biological traits
([36]Seeb et al., 2011). To obtain the molecular markers of the shared
genomic loci among individuals, many technologies were invented and
developed to probe whole-genome polymorphisms. The techniques allowing
synthesizing DNA probes in chips have led to the advent and application
of SNP microarrays ([37]Lipshutz et al., 1999), making it possible to
explore genome-wide SNP in a high-throughput manner. However the cost
of array design and application obstructs the wider usage in non-model
species, especially for endangered and economic organisms ([38]De
Donato et al., 2013). More importantly, microarray approaches cannot
discover novel SNP loci for species without reference sequences
([39]Popova et al., 2013). With the development of next-generation
sequencing (NGS), the state-of-art sequencing platform enable
scientists to scan small variants in genomes at an unprecedentedly
scale with rapidly decreasing price. The multiplex library strategies
were widely used to further reduce the cost per sample. However, the
budget is still one of the biggest challenges for whole-genome
resequencing in non-model samples ([40]Muir et al., 2016). Furthermore,
the whole-genome sequencing data for hundreds of individuals also
inevitably burdens the limited computational and bioinformatics
capacity in labs.
In the past few years, several robust sequencing-based genotyping
techniques have been invented in the research community to overcome the
bottle-neck of cost in whole-genome resequencing. Most of those
innovations employ a strategy of partial genome representation
sequencing ([41]Narum et al., 2013), such as restriction site
associated DNA (RAD) ([42]Rowe, Renaut & Guggisberg, 2011), IIB
restriction endonucleases based RAD (2bRAD) ([43]Wang et al., 2012) and
Genotyping-By-Sequencing (GBS) ([44]De Donato et al., 2013). RAD
applies a restriction enzyme to digest genome DNA and then random
fragment them to generate RAD tags. Although RAD experiments was
initially designed for microarray-based genotyping ([45]Miller et al.,
2007), the updated RAD tag isolation and library construction procedure
has been prevalently used to couple with high-throughput sequencing on
the Illumina platforms, resulting many successful applications for
genome-wide genotyping, genetic mapping, quantitative trait locus (QTL)
and association studies ([46]Baird et al., 2008). However, RAD still
depends on random fragmentations, reducing the consistence on SNP loci
among samples. [47]Elshire et al. (2011) subsequently developed a more
straightforward genotyping method as GBS with restriction enzymes of
ApeKI in maize and barley. The protocols for GBS are simple, extremely
specific and highly reproducible. In recent years, the easy
transferability of GBS to other species leads to many application in
plants ([48]Poland & Rife, 2012). One of the most attracting features
of GBS is the using of methylation-sensitive restriction enzymes during
libraries constructions to avoid repetitive fragments and to simplify
the reads alignments in extremely complex genomes ([49]Elshire et al.,
2011); therefore, GBS is an excellent whole-genome genotyping technique
for complex non-model organism genomes with massive repetitive regions
and abundant genetic diversities.
Teleost, representing a large portion of fish species, has been showed
to undergo the third round of whole-genome duplication (WGD) 370
million years ago ([50]Braasch et al., 2016; [51]Xiao et al., 2015b).
The extra WGD left a large portion of duplicated and repetitive
sequences in teleost genomes ([52]Berthelot et al., 2014; [53]Jaillon
et al., 2004), making the accurate whole-genome SNP marker development
was still challenging in many teleost species ([54]Wang et al., 2008).
We speculated that GBS technique provided an efficient way and was
suitable for genotyping in teleost complex genome. However, the
whole-genome SNP development and association studies based on GBS is
rarely reported on teleost fish species. Large yellow croaker
(Larimichthys crocea), belonging to the Sciaenidae family of teleost,
is an important marine fish in China and East Asia ([55]Xiao et al.,
2015a). Due to over-fishing and habitat degradation in last decades,
the wild stock of the species has rapidly collapsed ([56]Liu, Mitcheson
& Sadovy, 2008). The environmental changes and over-dense aquaculture
pose more challenges on population conservation and sustainable
development of the aquaculture for large yellow croaker. Whole-genome
molecular markers and genome-wide association studies (GWAS) for
important traits are prerequisites for the population conservation and
genomic selection of the species ([57]Steiner et al., 2013). However,
the association studies are rarely reported for large yellow croaker,
largely because of the lacking of abundant stable genomic SNP markers.
GBS technique provides the potential cost-efficient way for
whole-genome SNP marker development in complex teleost genome. In the
present investigation, we used large yellow croaker to verify the
applicability of GBS on teleost. Two restriction enzymes of EcoRI and
NlaIII based GBS protocol was developed and optimized. Massive
whole-genome SNP markers were developed from the sequencing reads by
bioinformatic pipelines, which were subsequently validated by
Sequenom MassARRAY assay. The detected SNP markers in this work were
then applied to the whole-genome association study of the muscle
Eicosapentaenoic Acid (EPA) and Docosahexaenoic Acid (DHA) content in
large yellow croaker. Our study confirmed the suitability of GBS on
whole-genome SNP marker development in teleost genome. The developed
whole-genome SNP markers and functional genes involved in muscle EPA
and DHA contents offered valuable genetic resources for conservation
genetics and genomic selection of large yellow croaker.
Materials and Methods
Ethics statement
The sample collection and experiments in the study was approved by the
Animal Care and Use Committee of Fisheries College of Jimei University
(Animal Ethics no. 1067).
Sample preparation and DNA extraction
The mixed reference population of 500 individuals was bred by a random
fertilization of 30 males and 30 females at the large yellow croaker
breeding base of Jimei University in Ningde, Fujian, China. All fish
individuals were 1.5 year old with the total length and weight of
24.5–25.9 cm and 217.8–234.1 g (95% confidence interval), respectively.
To extract genomic DNA respectively from 500 individuals, the dorsal
fins (20–30 mg) of the fish individuals were collected, frozen in
liquid nitrogen for the following DNA extraction. Total genomic DNA was
prepared in 1.5 ml microcentrifuge tubes containing 550 µl TE buffer
(100 mM NaCl, 10 mM Tris, pH 8, 25 mM EDTA, 0.5% SDS and proteinase K,
0.1 mg/ml). The samples were incubated at 55 °C overnight and
subsequently extracted twice using phenol and then phenol/chloroform
(1:1) method. DNA was precipitated by adding two and a half volumes of
ethanol, collected by brief centrifugation, washed twice with 70%
ethanol, air dried, re-dissolved in TE buffer (10 mM Tris–HCl, 1 mM
EDTA, pH 7.5). DNA concentration and quality were estimated with an
ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and by
electrophoresis in 0.8% agarose gels with a lambda DNA standard.
In silico enzyme assessment for GBS library construction
To assess how endonucleases influence the DNA fragment length
distribution, four enzyme combinations were designed for in silico
digestion of large yellow croaker genome: ApeKI-PstI, EcoRI-BstNI,
EcoRI-NlaIII and PstI-NlaIII (NEB, Ipswich, MA, USA). The activation
and heat-inactivation temperature, location in chromosome and length
distribution of all fragments were analyzed to evaluation the best
performance of each enzyme combination.
GBS library construction and sequencing
The GBS libraries were constructed based on two DNA endonucleases:
EcoRI (NEB, Ipswich, MA, USA) and NlaIII (NEB, Ipswich, MA, USA). A
pilot GBS experiment was performed before the library construction to
optimize the temperature and time parameters for yield, size
distribution. Based on the pilot experiment, the GBS libraries of large
yellow croaker based on EcoRI and NlaIII were constructed following the
similar method in previous report ([58]Beissinger et al., 2013).
Briefly, genomic DNA (20 ng/µl) was incubated at 37°C with EcoRI and
NlaIII, 10XCutSmart™ Buffer. The restriction reactions were
heat-inactivated at 65 °C by 20 min and were kept in 8 °C for the
following experiments. Sequencing adaptor and barcode mix, T4 DNA
Ligase, 10 mM ATP and 10XCutSmart™ Buffer were incubated at 16 °C for
2 h for ligation reactions. The reactions were then heat-inactivated at
65 °C by 20 min and the reaction systems were kept in 8 °C. Then,
polymerase chain reactions (PCR) experiments were performed in the
reaction solutions (20 µL) containing the diluted restriction/ligation
samples (4 pM, 2 µL), dNTP (each at 10 mM, 5 µL), Taq DNA polymerase
(NEB, Ipswich, MA, USA) (5 units/µL, 0.25 µL), Illumina Primers (each
at 10 µM, 1 µL) and Indexing Primers (10 µM, 1 µL). The PCR procedure
was: 95 °C 2 min; 15 cycle of 95 °C 30 s 60 °C 30 s, 72 °C 30 s; 72 °C
5 min and kept in 4 °C. The PCR products were run on a 8%
polyacrylamide gel electrophoresis. Fragments of 200–300 bp were
isolated using QIAGEN QIAquick^® Gel Extraction Kit and diluted for
pair-end sequencing on an Illumina HiSeq 2500 sequencing platform
(Illumina, Inc, San Diego, CA, USA).
Sequencing read quality control and genotyping
The raw sequencing reads generated by Illumina HiSeq 2500 from the GBS
libraries were treated and cleaned for SNP detection. First, the
adaptors were removed and the resulted reads were split by
sample-specific barcode sequences. Only reads begins with the digest
site sequences of EcoRI and NlaIII were retained for the following
quality control. Second, the overall base and read quality were
accessed by FastQC. To avoid the negative influence of ambiguous bases
for SNP detection, reads with more than 5% of N were removed. Then, the
resulted reads were cleaned by the following steps: (1) discarding the
reads that the quality lower than 20; (2) deleting 5 bp windows in
reads end that the average quality smaller than 20; (3) removing read
pairs if one end was shorter than 50 bp.
The cleaned reads were mapped to large yellow croaker genome by BWA
0.7.6a ([59]Li & Durbin, 2009). The mapping was preceded by a short
reads alignment with BWA-MEM algorithm. The alignment were then sorted
by coordinates and duplicate marked by SortSam and MarkDuplicates
programs in Picard tools 1.107 (picard.sourceforge.net), respectively.
To reduce the false positives of SNP detection in this study, three
processes were carried out: (1) short read mapping were re-aligned by
local bases matches; (2) base Quality Score Recalibration (BQSR) was
employed to adjust the accuracy of the base and mapping quality scores;
(3) only reads pairs that both aligned on genome with a mapping score
higher than 30 were used for SNP calling. Then, the SNP markers were
detected by GATK UnifiedGenotyper utility.
SNP validation by Sequenom MassARRAY assay
Genomic DNA was extracted from dorsal fin ray tissue as the method
described before. PCR amplification was performed in the reaction
system (5 µl total volume) containing 20 ng of genomic DNA, 0.5U
HotstarTaq (Qiagen), 0.5 µl 10× PCR buffer, 0.1 µl dNTPs for each
nucleotide and 0.5 pmol of each primer. All PCR experiments were
carried out in a PTC-100 PCR instrument (Eppendorf) with the following
program: 4 min denaturation at 94 °C, 35 cycles of 20 s at 94 °C, 30 s
at 56 °C and 1 min at 72 °C and a final extension at 72 °C for 3 min.
After the PCR products were cleaned using 2 µl SAP (SEQUENOM), the
single base extension used 2 µl EXTEND Mix (SEQUENOM) contained 0.94 µl
Extend primer Mix, 0.041 µl iPLEX enzyme and 0.2 µl iPLEX termination
mix and performed with the following steps: initial denaturation at
94 °C for 30 s, followed by 40 cycles of 3-step amplification profile
of 5 s at 94 °C, additional 5 cycles of 5 s at 52 °C and 5 s at 80 °C
and a final extension at 72 °C for 3 min.The PCR product was cleaned by
resin purification and then analyzed using MassARRAY Analyzer Compac
(SEQUENOM) and software TYPER (SEQUENOM).
To evaluate the accuracy of SNP detection in this study, the genotypes
from GATK SNP calling were compared with those from MassARRAY assay. If
the genotypes of one SNP locus from GATK calling were identical with
that in MassARRAY, then the locus was called a correct genotype. As a
result, 1,421 of 1,500 SNP loci were correctly genotyped by GATK and
the success rate of SNP calling was ∼94.7%. The specificity and
sensitivity of SNP calling in the study were also evaluated. The
reference homozygous genotypes (AA) both from MassARRAY and GATK were
called true negatives, and the heterozygous genotypes or allelic
homozygous (AB and BB) both from MassARRAY and GATK were called true
positives. Specificity was then calculated as the number of true
positives divided by the number of true positives plus the number of
false positives, and the sensitivity was estimated as the number of
true positives divided by the number of true positives plus the number
of false negatives as the following formula:
[MATH: Specificity=true negativetrue
negative+false
positive :MATH]
[MATH: Sensitivity=true positivetrue
positive+false
negative. :MATH]
Association analysis with the muscle EPA and DHA content
From the 500 large yellow croaker population, 200 individuals were
randomly selected for the muscle EPA and DHA content measurement for
the following statistics and association analysis. The fat acid
composition analysis followed the similar methods in previous reports
([60]Murillo, Rao & Durant, 2014). Briefly, the total lipid was
extracted from the fresh muscle tissue using the chloroform-methanol
method ([61]Folch, Lees & Sloane-Stanley, 1957). After saponification
with 1 ml of 50% KOH in 15 ml ethanol, the lipid was then esterified in
80 °C for 20 min using 6.7% boron trifluoride (BF3) in methanol (Morita
chemical industries Co., Ltd., Osaka, Japan). After making up in hexane
(20 mg/ml), fatty acid methyl esters (FAME) preparations were analyzed
by gas chromatography (GC). The temperature increase of 170–260 °C at
2 °C/min was set and helium was used as the carrier gas. Since the
muscle contents of EPA and DHA were highly correlated, we combined
those two components together in fish muscle.
Prior to the association study, pairwise clustering based on the
alleles shared identical by state (IBS) between any two individuals was
performed to assess the genetic relatedness. The population
stratification was analyzed by multi dimensional scaling (MDS)
clustering methods available in the PLINK software ([62]Purcell et al.,
2007).
With the developed SNP markers, the association analysis was performed
between genotypes and measured muscle EPA and DHA content using Plink
1.07 A simple linear regression of phenotype on genotype was performed
in the analysis. Markers with p-values ≤ 1e–4 were considered
significantly associated with muscle EPA and DHA contents. To identify
the biological functions of nearby genes and whether the orthologs of
these significantly associated loci were also associated with the EPA
and DHA content in other species, we identified the protein-coding
genes around 50 kb of the significant SNP markers. We aligned the genes
against the NCBI nr database by Blastx ([63]Altschul et al., 1997). GO
term and KEGG pathway enrichment analysis of the associated genes were
performed with Gene set enrichment analysis (GSEA) ([64]Shi & Walker,
2007) by two-tailed Fish’s exact test with Benjamini & Hochberg false
discovery rate (FDR) ([65]Benjamini & Hochberg, 1995) against the
background of the all protein-coding gene in large yellow croaker
genome. The additive genetic variances were estimated by using
R-package EMMREML, Version 3.1.
([66]http://mirror.bjtu.edu.cn/cran/web/packages/EMMREML/index.html).
Results
Enzyme assessment and GBS construction for large yellow croaker
According to the principles of the enzyme combination design for GBS
library construction, four enzyme combinations were designed for the
GBS analysis of large yellow croaker genome: ApeKI-PstI, EcoRI-BstNI,
EcoRI-NlaIII and PstI-NlaIII (NEB, Ipswich, MA, USA). To assess the
fragment size distribution and the number of potential SNP marker
developed, the public large yellow croaker draft genome sequences
([67]Ao et al., 2015) were in silico digested by the four two-enzyme
combinations to mimic the genomic fragmentation. As shown in the
[68]Fig. 1, the predicted fragment numbers decreased with the fragment
size for all enzyme combinations, but the ApeKI-PstI and EcoRI-BstNI
lead to a large portion of fragments longer than 1 kb. According to the
size distribution in [69]Fig. 1 and to make the fragment size more
compatible to NGS sequencing, the genomic fragment with a size of
100–300 bp were preferred; therefore, EcoRI-NlaIII and PstI-NlaIII were
the rational combinations for the following library construction.
According to our in silico experiments, genomic fragment in the range
of 200–300 bp were used to construct GBS libraries (see ‘Method’ for
details). By the assessment of the combination of EcoRI-NlaIII, roughly
1.5 million fragments would be collected in libraries.
Figure 1. Fragment length distribution by restraint enzyme combination.
[70]Figure 1
[71]Open in a new tab
Note that all fragments longer than 1 kb were accumulated in the last
bar.
Library sequencing and reads mapping
GBS libraries were constructed by the two enzyme based digestion (see
‘Method’ for the details). The NGS sequencing of GBS libraries for 500
individuals generate roughly 314 Gb raw sequencing reads. To evaluate
the raw data distribution among samples, we found that the majority of
individuals (∼95%) had the raw sequencing reads ranged from 600 to 650
Mb, indicating the excellent sequencing uniformity among samples from
library construction and sequencing. The raw reads were cleaned by
HTSeq to trim low quality ends (average quality < 20) and eliminate
short reads (length < 50 bp). The cleaned reads were mapped to large
yellow croaker reference genome sequences ([72]Ao et al., 2015) by BWA
([73]Li & Durbin, 2009). To assess the quality of GBS library, the
mapped reads distribution of Sample 88 along the linkage groups were
illuminated as an example ([74]Fig. S1). We found that reads were
evenly covered all linkage groups of large yellow croaker, indicating
an ideal representativeness of the libraries at the whole genome level.
The covered loci depth distribution ([75]Fig. S2) showed that the
majority of depth ranged from 5 to 20 and extreme reads enrichment on
genome local regions were successfully avoided in sequencing libraries.
SNP discovery among samples in large yellow croaker genome
To develop molecular markers based on the GBS library sequencing, SNP
variants markers were detected from the reads alignments by GATK
([76]McKenna et al., 2010) pipelines (see ‘Method and Material’ for
detailed information). To improve the quality of the detected SNP, we
employed the extra reads local re-alignment and Base Quality Score
Recalibration (BQSR) steps in SNP calling pipelines. Previous
literatures on model organisms showed that those extra processes on
reads alignment and SNP quality could significant reduce the false
positives SNPs ([77]De Pristo et al., 2011; [78]McKenna et al., 2010;
[79]Van der Auwera et al., 2013), therefore our refined bioinformatics
pipeline coupled with the library construction provided a solid
foundation for SNP detection in this study. As a result, 489,246 SNP
markers were discovered in at least 200 large yellow croakers with a
loci depth threshold of 3. It is not surprised that the majority of SNP
markers were not shared by all samples because of the inherent DNA
polymorphisms on enzyme digestion site in genome. The number of shared
SNP markers among samples was crucial for the evaluation of the GBS
sequencing of large yellow croaker genome, especially for the studies
of QTL and GWAS analysis in populations. We further used depth- and
population-based method to investigate the influence of loci depth and
population size on the number of shared SNP marker. As we expected,
both the population size and loci depth dramatically influenced the
number of shared SNP markers ([80]Fig. 2). However, hundreds of
thousands of the shared SNP markers were identified with a depth
threshold of 5 in the study. According to previous literatures on SNP
development in non-model organisms, the depth filtering of 5 provided
high quality SNP markers for the genetic studies ([81]Hiremath et al.,
2012; [82]Nguyen, Hayes & Ingram, 2014); therefore, our SNP calling
based on GBS library developed sufficient SNP markers for the
biological trait mapping and conservation genetics. We indeed found the
sharp decreases on the number of shared SNPs for the population size
from 450 to 500, which could be attributed to the samples with
extremely low sequencing amount.
Figure 2. SNP number against sequencing depth and completeness.
[83]Figure 2
[84]Open in a new tab
Note that only SNP loci witha quality score larger than 100 were used
in the analysis.
To control SNP marker quality while maximizing the number of shared
samples and to facilitate the following GWAS analysis, markers with the
loci depth higher than 5 and the shared in at least 400 individuals
(90% of all sample) were used for the following analysis, resulting
69,845 SNP markers in large yellow croaker genome. To answer the
question if our sequencing data was sufficient for the whole-genome SNP
development, the numbers of the detected genomic SNP markers were
plotted against the sequencing data for each sample. As shown in
[85]Fig. 3, the number of the discovered SNP marker increased with
sequencing reads and remained to be ∼70,000 when the sequencing amount
reached 600 Mb, implying that 600 Mb might be an optimal data amount
for the trade-off the cost and SNP number in our large yellow croaker
GBS libraries. The distribution of those SNP markers in 24 linkage
groups ([86]Fig. 4) showed that those SNP markers were ideally evenly
distributed in the genome, suggesting an excellent representation of
whole genome markers in large yellow croaker. The location and
functions of SNPs were investigated by comparing the locus coordinates
with those of gene annotations. We found that ∼53 % of these SNPs were
from genic regions, including exons (3,000), introns (27,114), and
untranslated regions (UTRs, 9,166) ([87]Fig. 3). The detailed SNP
categories in UTR revealed that 4,311 and 4,855 SNPs were from 5UTR and
3UTR respectively. The biological functions of those SNP markers were
analyzed according to their relative positions of the protein-coding
genes. As in [88]Fig. 5, 866 SNPs markers in coding regions caused
synonymous mutations. Of the remaining markers, 3,022 SNPs could lead
to a change of amino acid and introduction of frame shift and new or
lost start/stop codons. Those SNP markers might significantly alter the
biological functions of the hosting genes and thus influence the
biological traits that controlled by those genes.
Figure 3. SNP number against sequencing amount.
[89]Figure 3
[90]Open in a new tab
The distribution of sequencingamount (top) and SNP marker number
(right) were plot by sides. The line in the scatter is the smoothed
curve cross all samples, and the grey area represent the 95% of the
confidence region.
Figure 4. SNP distribution along chromosome.
[91]Figure 4
[92]Open in a new tab
The lines along the chromosomes represent SNP loci. The SNP location in
exon, intron, UTR and intergenic regions are showed by red, blue,
purple and green, respectively.
Figure 5. Biological impact annotations of high quality SNP markers that
shared by at least 80% of the population with 500 large yellow croakers.
[93]Figure 5
[94]Open in a new tab
Experimental validation of detected SNP loci
To assess the reliability of the SNP makers developed from the reduced
representation libraries, 50 loci from 30 individuals were randomly
selected to validate the marker polymorphism by the Sequenom MassARRAY
assay. As shown in [95]Table 1, MassARRAY assay verified the most of
detected SNP markers in those samples. Among 1,500 markers, 1,421 were
validated by MassARRAY, confirming our library construction, sequencing
and SNP marker calling pipelines. The primers for SNP validation and
the detailed genotypes were listed in [96]Tables S1 and [97]S2,
respectively. As shown in the [98]Table 1, the specificity and
sensitivity for the SNP genotype detection in the present study were
estimated as 94.2% and 98.3%, respectively. Notably, we found that the
majority of discordant genotypes were heterozygous, which was
consistent with the reports for other organisms ([99]Sonah et al.,
2013). We attributed the error-prone genotypes in heterozygous markers
to the fact that those markers need more supporting reads than their
homozygous counterparts. However, the Sequenom MassARRAY assay still
successfully validated ∼95% of the detected SNP marker developed by the
GBS library sequencing, providing us solid SNP genotypes of the
following trait association and other genetics studies for large yellow
croaker.
Table 1. SNP validation by Sequenom MassARRAY assay.
NNs indicate the failed genotypes during the SNP filtering.
Genotypes Sequenom MassARRAY assay
AA AB BB NN
SNP calling AA 901 0 2 0
AB 54 404 7 0
BB 0 2 116 0
NN 11 1 2 0
[100]Open in a new tab
The association study with the muscle EPA and DHA content
To apply the genome-wide markers to probe potential marker and genes
contributing to muscle EPA and DHA contents, 200 large yellow croakers
reared with the identical feed in the same netcage were used to
quantify EPA and DHA level. Muscle EPA and DHA contents in 176
individuals were successfully extracted and measured. The contents
exhibited a typical normal-like distribution (p-value of 0.94 with
Kolmogorov–Smirnov test) with an average of 21.5 mg/g and a standard
deviation of 4.1 mg/g ([101]Fig. S3). The difference of the highest and
the lowest EPA and DHA contents was ∼13.8 mg/g.
Before the association study, population stratification with MDS
clustering showed that all samples were grouped into several clusters
([102]Fig. S4). 200 individuals used in the association study were
distributed among population clusters, providing diversified genetic
background for the following studies. From the MDS plot ([103]Fig. S4),
the first and second dimension value (arbitrary value) were all smaller
than 0.05, indicating homogeneous genotypes among samples. The
association study of SNP marker with the muscle EPA and DHA content was
performed with the linear model with a covariance to sex in Plink
([104]Purcell et al., 2007). 69,845 SNP loci developed above with depth
threshold of 5 were used to perform the association study ([105]Fig.
5). As shown in [106]Fig. 6, 39 markers from 11 linkage groups were
exhibited significant association with the EPA and DHA content (p-value
< 1e–4). Notably, many associated markers were significant by clusters
in linkage group 4, 5 and 11, suggesting the credibility of the
association studies. The results might also imply that many genes
contributed to the muscle EPA and DHA levels in large yellow croaker.
With the variance estimation by Restricted Maximum Likelihood (REML)
method ([107]Smith & Graser, 1986), we found that those 39 significant
markers could interpret as high up to ∼63.0% of genetic variance
explained by all 69,845 markers.
Figure 6. GWAS analysis on the muscle EPA and DHA content and the functional
analysis therelated protein-coding genes.
[108]Figure 6
[109]Open in a new tab
(A) The association results were illuminated in the Manhattan plot. The
red line is the p-value threshold for significant markers; (B) KEGG
pathway enrichment of functional genes; (C) GO term enrichment of
related biological functions for the associated genes.
To identify gene contributing to the muscle EPA and DHA content in
large yellow croaker, we investigated the biological functions of
protein-coding genes within 50 kb of all significant SNP markers
(p-value < 1e–4). As a result, 122 genes were identified from the
above association regions. The biological KEGG pathway and GO term
annotations of the associated genes were enriched under the background
of all protein-coding genes. The metabolic pathway of fat digestion and
absorption was significant (FDR < 0.023) in the KEGG enrichment
([110]Fig. 6B and [111]Table S3). Meanwhile, GO terms of unsaturated
fatty acid biosynthetic process, fatty acid derivative biosynthetic
process and lipid transporter activity were also highlighted (FDR
< 0.05) for the associated functional genes ([112]Fig. 6C). The
detailed gene function GO annotations were summarized in [113]Table S4.
We found that the many identified genes played important roles in lipid
transport, metabolism and transcription regulation, such as
apolipoprotein B (APOB), Carnitine O-acetyltrasferase (CRAT) and
oxysterol binding protein 10 (OSBPL10). APOB is a crucial lipid
transport protein in organism. Previous nutriology studies confirmed
the correlation of EPA and DHA contents with APOB genotypes and gene
expression ([114]Anil, 2007). Given their close relationship, we
speculated that the polymorphisms on APOB gene might contribute to the
EPA and DAH accumulation in large yellow croaker muscle. CRAT and
OSBPL10 may also involved in the muscle EPA and DHA content since
carnitine and oxysterol were important components and regulators in EPA
and DHA synthesis pathways according to previous reports ([115]Qiu,
2003; [116]Rise, Marangoni & Galli, 2002). Notably, we observed
palmitoyl-protein thioesteraes 2 (PPT2) (around a maker with a p-value
of 6.7e−06) as a potential functional gene contributing to muscle EPA
and DHA contents. PPT2 gene was also identified by genome-wide
association study on n-3 and n-6 polyunsaturated fatty acid levels in
Chinese and European-ancestry populations ([117]Dorajoo et al., 2015;
[118]Hu et al., 2016).
Discussions
The advent and development of NGS have unprecedentedly prompted the
application of the whole-genome marker development ([119]Seeb et al.,
2011). Recently, SNP developments on genomic level were performed in
many species, including livestock and fish in agriculture ([120]Sun et
al., 2014). However, the cost for whole-genome re-sequencing is still
one of the largest challenges in genomic marker developments. Based on
NGS, GBS generally used multiple endonucleases to obtain the desired
genomic length and the number of fragments to squeeze the sequencing
cost ([121]De Donato et al., 2013; [122]Elshire et al., 2011;
[123]Sonah et al., 2013), thus improving the specificity of marker
detection along individuals; however, most of the GBS application were
reported for plant genomes. Teleost, representing a large portion of
fish species, has been showed to undergo the additional third round of
whole-genome duplication (WGD) 370 million years ago. The extra genome
duplication led to a large portion of duplicated and repetitive
sequences in teleost genomes ([124]Sémon & Wolfe, 2007). GBS techniques
provided an efficient way to probe polymorphism markers from complex
genomes; however, the whole-genome SNP development and association
studies based on GBS is rarely reported on teleost fish species. In
this work, we used teleost large yellow croaker to verify the
applicability of GBS on genomic marker development on teleost species.
So far as we know, this is the first GBS implementation in large yellow
croaker genome. The developed SNP markers provided useful resources for
the following genetic studies, including population structure,
conservation and functional gene mapping of important traits of the
species. The enzyme combination and GBS protocols used in this study
could also be valuable reference for other teleost species.
Our in silico experiments mimicked the two enzyme digestion on large
yellow croaker genome. After the detail investigation of the enzymes,
we chose the combination of EcoRI and NlaIII for GBS protocol for two
reasons. Firstly, EcoRI and NlaIII possessed the identical
heat-inactivation temperature, which facilitated the pilot studies to
optimize the experimental conditions for library construction; secondly
and more importantly, EcoRI was sensitive when restriction site
overlaps methylation sequence of CpG islands, therefore the using of
the enzyme would partially avoid the digestion in repetitive regions.
Many previous GBS libraries were constructed with the fragment length
100–300 bp or even wider ([125]Elshire et al., 2011; [126]Sonah et al.,
2013); however, we predicted ∼3 million fragment would be generated in
that range. The large number of fragment implied a large amount of
sequencing reads to cover those genomic regions, which would increase
the unit-cost for the sequencing. To reduce the genomic fragments
needed to be sequenced for libraries in this study, we attempted to
narrow the length range to 200–300 bp, which was predicted to generate
roughly 1.5 million genomic fragment for sequencing.
Taking the SNP frequency of 1 per 1,000 bp ([127]Pushkarev, Neff &
Quake, 2009), the library sequencing might result into roughly 300
thousand SNP markers along large yellow croaker genome. Our estimation
was based on the assumption that all individuals have no mutation on
endonuclease digesting site and the read depth were high enough to
cover SNP loci. However, because of the divergent genomic background
among populations, it is very hard to detect all SNP markers that
shared by all individuals. In this work, 489,246 raw SNP markers
supported by more than three reads were detected with an average
sequencing amount of 600 Mb in at least 200 individual from the 500
large yellow croaker population. To facilitate the following marker
association study and breeding practise, previous studies proposed
several methods to filter the high quality SNP that shared in more
individuals, such as depth-based ([128]Li et al., 2009), quality
score-based ([129]Brockman et al., 2008) and population-based
([130]Bansal et al., 2010) manner. In this study, we employed a
composite strategy for SNP filtering by simultaneously considering loci
depth, marker quality and shared population size. As a result, 69,845
SNP markers were left with a depth higher than 5, quality score higher
than 100 and shared with at least 80% individuals (400 large yellow
croakers). More than half (∼53%) of those detected quality SNP markers
resided in genic regions, enabling us to probe the possible association
of trait with the nearby functional genes. We noticed that the
percentage of markers in genic regions was higher than that of previous
reports in soybean (39.5%) ([131]Sonah et al., 2013) but lower than
that of sweat cherry (65.6%) ([132]Guajardo et al., 2015). Those SNP
markers generated from the reduced representation library, especially
those in genic regions, provided us an easy and efficient manner to
detect genomic small variants and to identify genomic regions related
to important traits of large yellow croaker at genomic scale. The
detected SNP markers were then validated by Sequenom MassARRAY assay
for the randomly selected 50 loci in 30 individuals. Although the
success rate (94.6%) was slight lower than that reported in the similar
study in soybean (98%) ([133]Sonah et al., 2013), the library
preparation protocol and bioinformatics pipeline provided us high
quality genotypes on those SNP loci in the population for the following
association studies.
The successful applications of GWAS have greatly prompted the
understanding to the genetic bases of important economic traits and
would eventually benefit the artificial breeding and population
conservation of non-model species ([134]Correa et al., 2015; [135]Narum
et al., 2013) EPA and DHA are both omega-3 poly-unsaturated fatty acids
that important in human physiology ([136]Swanson, Block & Mousa, 2012).
Previous medical experiments demonstrated their positive effects on
depressive symptoms in clinical trials ([137]Hoffmire et al., 2012) and
the essential functions in brain development ([138]Brenna & Carlson,
2014). Marine fish is a main source for human EPA and DHA supplement
and nutritional properties of fish meat are highly dependent on
polyunsaturated fatty acid levels; therefore the EPA and DHA content in
muscle is one of the important indicants for the meat quality of fish.
The genetic bases controlling EPA and DHA accumulation in fish species
are highly interconnected and not fully revealed. Identifying key SNP
loci and functional genes will increase our knowledge of molecular
mechanism of polyunsaturated fatty acid synthesis and metabolism in
marine fish. To the best of our knowledge, most of the previous
researches were focus on the genetic variants on poly unsaturated fatty
acid metabolism after fish oil supplements in human or gene expression
and EPA and DHA level changes with different feed in fish ([139]Gregory
et al., 2016; [140]Li et al., 2014; [141]Li et al., 2013;
[142]Trushenski et al., 2012). The association studies aiming to
identity potential functional genes contributing to EPA and DHA
accumulation in fish meat is rarely reported.
Among 176 individuals that were used to measure the muscle EPA and DHA
level, the average muscle EPA and DHA content in the top 20 large
yellow croakers (28.4 mg/g) was almost two-fold of that in the lowest
20 ones (14.6 mg/g). Given that those fish were reared in the same cage
and fed with the identical feed, there was a great potential to raise
the muscle EPA and DHA content in large yellow croaker via genetic
improvements. Using the developed quality SNP markers by GBS protocol,
39 SNP markers from 11 linkage groups were observed to be significantly
associated with muscle EPA and DHA levels. From the coordinates of gene
and SNP loci, 122 protein-coding genes were identified around those
significant markers. The functional analysis by homological searching
found that many genes were involved in fat metabolism and transport,
such as APOB, CRAT and OSBPL10. Unsaturated fatty acid biosynthetic
process, fatty acid derivative biosynthetic process and lipid
transporter activity and fat digestion and absorption pathway were
significantly enriched in GO terms and KEGG pathways for the identified
functional genes. Meanwhile, we observed large numbers of genes
functions in cellular metabolism, gene expression and translation
regulation, which may also play a role in modulating muscle EPA and DHA
contents ([143]Tables S2 and [144]S3). Interestingly, we identified the
potential functional gene of PPT2 gene in large yellow croaker that was
previously discovered during the whole-genome association of plasma n-3
and n-6 polyunsaturated fatty acid level in Asian and European
populations ([145]Hu et al., 2016). The PPT2 gene in the linkage group
5 of large yellow croaker might play a similar function in human and
also contribute to the muscle EPA and DHA level. This result suggested
that teleost fish and human may shared similar metabolic pathway for
the polyunsaturated fatty acid synthesis and accumulation; however
biological functions of PPT2 gene for the muscle EPA and DHA content in
large yellow croaker and other vertebrates need further gene functional
analysis.
Conclusions
Teleost were widely believed to undergo the third round of WGD during
the natural evolution; therefore, genomes of many teleost species were
characterized by the complexity of high heterozygosity and repeat
contents. In this work, EcoRI-NlaIII based GBS protocol was used to
develop the whole-genome SNP markers in teleost large yellow croaker.
The study verified the applicability of GBS on teleost species and
provided useful references for GBS applications in other fish species.