Abstract

   The domestication of a wild-caught aquatic animal is an evolutionary
   process, which results in genetic discrimination at the genomic level
   in response to strong artificial selection. Although black tiger shrimp
   (Penaeus monodon) is one of the most commercially important aquaculture
   species, a systematic assessment of genetic divergence and structure of
   wild-caught and domesticated broodstock populations of the species is
   yet to be documented. Therefore, we used skim sequencing (SkimSeq)
   based genotyping approach to investigate the genetic structure of 50
   broodstock individuals of P. monodon species, collected from five
   sampling sites (n = 10 in each site) across their distribution in
   Indo-Pacific regions. The wild-caught P. monodon broodstock population
   were collected from Malaysia (MS) and Japan (MJ), while domesticated
   broodstock populations were collected from Madagascar (MMD), Hawaii,
   HI, USA (MMO), and Thailand (MT). After various filtering process, a
   total of 194,259 single nucleotide polymorphism (SNP) loci were
   identified, in which 4983 SNP loci were identified as putatively
   adaptive by the pcadapt approach. In both datasets, pairwise F[ST]
   estimates high genetic divergence between wild and domesticated
   broodstock populations. Consistently, different spatial clustering
   analyses in both datasets categorized divergent genetic structure into
   two clusters: (1) wild-caught populations (MS and MJ), and (2)
   domesticated populations (MMD, MMO and MT). Among 4983 putatively
   adaptive SNP loci, only 50 loci were observed to be in the coding
   region. The gene ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) analyses suggested that non-synonymous mutated genes
   might be associated with the energy production, metabolic functions,
   respiration regulation and developmental rates, which likely act to
   promote adaptation to the strong artificial selection during the
   domestication process. This study has demonstrated the applicability of
   SkimSeq in a highly duplicated genome of P. monodon specifically,
   across a range of genetic backgrounds and geographical distributions,
   and would be useful for future genetic improvement program of this
   species in aquaculture.

   Keywords: low coverage sequencing, population structure, genetic
   improvement, outlier approach, domesticated population, Penaeus monodon

1. Introduction

   The world production of inland aquaculture reached 51.3 million tonnes
   in 2018, with their dominant production of 97.2% comprising of finfish,
   while mariculture produced 30.8 million tonnes, with 56.2% represented
   by mollusc [[46]1]. Marine shrimp industries yielded almost 4 million
   tonnes in the same year, playing the role as the major supply of
   shrimps in the global market [[47]2]. While disease outbreak and price
   fluctuations in the global trade have highly impacted the production
   and socio-economic development in many countries, the surging prices
   and shortage of good quality shrimp broodstocks have further impeded
   the shrimp industry [[48]1,[49]3]. In fact, shrimp farming is heavily
   dependent on wild-caught broodstocks or domesticated broodstocks with
   reduced quality, due to repeated spawning programs [[50]4,[51]5,[52]6].
   To ensure sustainable seed supply, standardized benchmarks for quality
   and quantity assessment of broodstocks are essential [[53]7].

   Population structure is the organization of genetic diversity, and it
   is influenced by multiple evolutionary process, such as genetic drift,
   mutation, gene flow, natural selection and demographic history
   [[54]8,[55]9]. It is estimated by parameters like genetic
   differentiation, variant alleles frequencies, population size and
   population dynamic [[56]10,[57]11,[58]12,[59]13]. Low number of
   breeders and inadvertent or mass selection of prospective shrimp
   broodstocks can result in rapid reduction of genetic diversity
   [[60]14,[61]15,[62]16,[63]17,[64]18,[65]19]. The loss of genetic
   variation, reduction in effective population size and accumulation of
   inbreeding effects over several generations of selective breeding can
   compromise the effectiveness of genetic improvement programs
   [[66]20,[67]21,[68]22]. Although the magnitude of inbreeding effects
   are variable between different traits, genetic structure of a
   population and environmental interactions, the overall mean phenotypic
   value of traits associated with reproductive fitness and physiological
   efficiency is often reduced [[69]23,[70]24,[71]25]. Therefore, adequate
   levels of genetic variation are vital for the maintenance of the gene
   pool in cultured shrimp with minimal effects of inbreeding depression
   and enhanced capacity in responding to environmental changes
   [[72]26,[73]27]. Long-term crossbreeding programs may help to achieve a
   balance between continuous genetic gains and reduced risk of inbreeding
   depression [[74]6,[75]28].

   Penaeus monodon, commonly known as black tiger shrimp is widely
   distributed across the Indo-Pacific region, and is one of the most
   commercially important aquaculture species [[76]29,[77]30]. To date,
   population structure studies of P. monodon have only been limited to
   several traditional markers such as mtDNA RFLP [[78]29], allozymes
   [[79]31], microsatellite [[80]32,[81]33] and Sanger sequencing of
   various genes including mtCR (mtDNA control region) [[82]34,[83]35],
   mitochondria DNA [[84]36] and elongation factor-1a [[85]37], while next
   generation sequencing (NGS) approaches are yet to be explored for this
   species in this aspect. In fact, NGS have been well proven in revealing
   fine-scale population structure and phylogeographic divergence in
   numerous aquatic species
   [[86]38,[87]39,[88]40,[89]41,[90]42,[91]43,[92]44].

   Skim sequencing (SkimSeq) is one of the less complex NGS methods, which
   uses low coverage (1–10X) whole genome sequencing of multiple
   individuals for high resolution genotyping
   [[93]45,[94]46,[95]47,[96]48]. SkimSeq is less laborious with fewer
   complex steps, is unbiased towards specific alleles, is capable of SNPs
   detection, which enables informative sampling and validation of the
   genome [[97]49,[98]50,[99]51,[100]52,[101]53]. This genotyping by
   sequencing approach is useful in population study with unknown parental
   genome information to generate detailed diversity analysis and
   marker-assisted selections [[102]45,[103]53].

   Given that the SkimSeq is known as low coverage genome sequencing
   approach [[104]38], and so far have only been applied in plant genomics
   research [[105]45,[106]46,[107]47,[108]48], we have chosen this
   technique to generate high resolution sequence dataset, as reported in
   previous studies, for the first aquatic invertebrate with small samples
   size. To explore the genetic diversity, population structure and
   discover the novel molecular markers of P. monodon from different
   origins (wild vs. domesticated), we genetically assayed 50 individuals
   with the SkimSeq approach using the short read Illumina sequencing
   platform. This study aimed to: (i) investigate the genetic structure of
   wild and domesticated populations, (ii) quantify the genetic
   differentiation between populations and (iii) examine the presence of
   putative loci causing the genetic variation. The genomic analyses and
   genetic resource acquired from this study will be useful to support
   future genetic improvements in P. monodon culture and brood-stock
   selection activities.

2. Materials and Methods

2.1. Sample Collection

   A total of 50 P. monodon individuals consisting of both domesticated
   and wild-caught broodstocks were collected from five sampling sites (10
   individuals from each site and of same family) across Indo-Pacific
   regions ([109]Figure 1), and the sampling details are listed in
   [110]Table 1. Among the five locations, P. monodon broodstock
   populations from MMD (Madagascar), MT (Thailand) and MMO (Hawaii, HI,
   USA) were obtained from domesticated shrimp farms, while MS (Malaysia)
   and MJ (Japan) were caught from the wild. The muscle tissues were
   extracted from broodstock individuals and preserved in 99.5% v/v
   ethanol and stored at −20 °C until DNA extraction. All samples were
   collected in accordance with the animal care and tissue collection
   protocol as approved by the Universiti Malaysia Terengganu’s Animal
   Care and Biosafety Committee.

Figure 1.

   [111]Figure 1
   [112]Open in a new tab

   Sampling sites of five broodstock populations of Penaeus monodon in the
   Indo-Pacific region. MMD (Madagascar), MT (Thailand) and MMO (Hawaii,
   HI, USA) represent domesticated populations, while MS (Malaysia) and MJ
   (Japan) are wild-caught populations. Values in parentheses denote the
   sample size for each population.

Table 1.

   Summary of the sampling information of Penaeus monodon population
   collected from the different Indo-Pacific regions.
   Locations Location Abbreviation Broodstock Source Latitudes Longitudes
   Year
   Mahajamba, Madagascar MMD Domesticated 16°02′52.8″ 47°11′38.0″ 2018
   Hawaii, HI, USA MMO Domesticated 19°42′55.9″ 156°02′34.6″ 2018
   Petchaburi Province, Thailand MT Domesticated 12°58′06.5″ 99°37′48.0″
   2019
   Setiu Wetland, Malaysia MS Wild 5°40′38.3″ 102°42′36.8″ 2019
   Shizuoka, Japan MJ Wild 34°56′25.9″ 138°02′17.9″ 2018
   [113]Open in a new tab

2.2. DNA Extraction and Library Preparation

   Genomic DNA of P. monodon broodstocks were isolated from the muscle
   tissues using Wizard Genomic DNA Purification Kit following the
   manufacturer’s protocols (Promega, San Luis Obispo, CA, USA). The
   concentration and purity of the extracted genomic DNA were quantified
   based on A260/280 nm ratio using BioDrop (BioDrop, Cambridge, UK). DNA
   quantifications were conducted using real-time PCR fluorescence
   measurements of double stranded DNA [[114]54] and the Quant-it kit
   (Life Technologies, Foster City, CA, USA). Genomic DNA was fragmented
   into the insert size of 350 bp using with TruSeq^® DNA PCR-Free Library
   Prep Kit (Illumina Inc., San Diego, CA, USA) and Covaris M220 (Covaris
   Inc., Woburn, MA, USA), following the kits’ protocol at Monash
   University Malaysia Genomics Facility (Selangor, Malaysia). Accurate
   quantification and quality checking of the DNA libraries were conducted
   using the KAPA Library Quantification Kit (Roche Sequencing and Life
   Science, Indianapolis, IN, USA), while qualitative estimation of the
   libraries was performed by Agilent Technologies 2100 Bioanalyzer
   (Agilent Technologies Inc., Santa Clara, CA, USA). Detailed QC
   parameters, including library cluster density, library complexity,
   percent duplication, GC bias, and index representation were generated
   on the MiSeq system (Illumina Inc., Foster City, CA, USA) to ensure a
   uniform concentration of all samples prior to SkimSeq. The pooled
   library was denatured based on the Illumina NextSeq denaturation
   guideline and paired-end sequencing was carried out using NextSeq
   500/550 High Output v2 300 cycles kit on a NextSeq 500 (Illumina Inc.,
   Foster City, CA, USA).

2.3. Sequence Assembly, Filtering and SNPs Discovery

   From the Illumina platform, 151 base-pair paired-end (2 × 151)
   sequencing reads were obtained in FASTQ format. Sequencing reads
   matching to PhiX DNA sequences were first removed by aligning reads
   against the PhiX DNA sequence using Bowtie 2 software version 2.2.3
   ([115]http://bowtie.cbcb.umd.edu) [[116]55,[117]56]. The cleaned reads
   were then subjected to Illumina sequencing adapter trimming and base
   quality (Q ≥ 20) trimming using PEAT software version 1.2.4
   ([118]http://jhhung.github.io/PEAT) [[119]57], to ensure that only good
   quality bases derived from the sample were further analyzed. Trimmed
   reads of less than 36 bp were also discarded using Trimmomatic software
   version 0.36
   ([120]http://www.usadellab.org/cms/index.php?page=trimmomatic)
   [[121]58]. The quality trimming and filtering analyses of the whole
   genome SkimSeq data revealed that each of the 50 individuals of P.
   monodon had about 30% low-quality reads, which were discarded before de
   novo genome assembly ([122]Table S1). In the present study, the good
   quality reads of all samples were assembled collectively using
   SOAPdenovo software version 2-r240
   ([123]ftp://public.genomics.org.cn/BGI/SOAPdenovo2) [[124]59] into a
   set of scaffolds. All samples (except MJ10) had greater than 91%
   mapping rate to the assembled scaffold, indicating that the assembled
   scaffold is able to comprehensively represent all samples ([125]Table
   S2). The calculated coverage of SkimSeq data in different samples was
   found to vary from 1.3 to 1.8X. Subsequently, BLAST+ version 2.2.31
   ([126]ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST)
   [[127]60] was used to identify the coding protein sequences from the
   order of Decapoda and the full SwissProt database [[128]61] to the
   assembled scaffold sequences. The good quality sequencing reads were
   aligned to the assembled scaffold sequences using BWA version
   0.7.12-r1039 ([129]http://maq.sourceforge.net) [[130]62]. The reads
   alignments were sorted, and potential PCR duplicate reads were
   identified and marked using Picard Tools version 2.9.0. The reads
   alignments were analyzed to identify variants using FreeBayes version
   1.2.0-2-g29c4002 [[131]63] to find small polymorphisms, specifically
   SNPs (single-nucleotide polymorphisms), indels (insertions and
   deletions), MNPs (multi-nucleotide polymorphisms) and complex events
   (composite insertion and substitution events) smaller than the length
   of a short-read sequencing alignment. The outputs includes a total
   number of 17,226,908 raw variants loci in the VCF file format. The
   identified variants coordinates were compared to the gene coordinates
   obtained from genome annotation using vcfanno software version 0.2.9
   ([132]https://github.com/brentp/vcfanno) [[133]64]. If a variant fell
   within the region for a protein-genome match, the information of the
   protein name was transferred to the variants as an annotation.

   SNP profiles were analyzed and visualized using SNPRelate version
   1.16.0 [[134]65], to exclude outlier samples with inconsistent genetic
   profiles. Likewise, samples containing a completeness of data of less
   than 80% among the remaining loci, a minimum quality score with minor
   allele frequency below threshold, and a mean depth per genotype lower
   than 20 were removed from the dataset. Out of the 50 P. monodon, 5
   individual sequences (MJ10, MS9, MMO1, MMO7 and MMO16) were removed
   from the dataset due to the inconsistent SNP profiles and/or greater
   than 20% missing genotypes, while the remaining 45 samples were used
   for all downstream analyses. Furthermore, variants filtering was
   conducted using VCFtools software version 0.1.16
   ([135]https://vcftools.github.io/) [[136]66] and BCFtools software
   version 1.9 ([137]http://samtools.github.io/bcftools/bcftools.html). At
   first, the variants filtering steps included the removal of complex
   indels, SNPs with more than two alleles, and composite insertion and
   substitution events. Further filtering steps included the removal of
   sites with less than 5% overall minor allele frequency, missing
   genotypes in >90% of the samples in any population, and SNP sites with
   genotypes not in Hardy-Weinberg equilibrium in any population (PHWE <
   0.001). After all filtering steps, a total of 194,259 individual SNP
   loci remained in the dataset. To detect putatively adaptive SNP loci
   among different wild and domesticated broodstock populations of P.
   monodon, we identified outlier SNPs from the 194,259 filtered
   individual SNP loci using pcadapt version 3.0.2 [[138]67,[139]68]. The
   “pcadapt” approach performs a principal component analysis and computes
   the p-values of each locus to detect adaptive loci. Default parameters
   were used for pcadapt analysis, and the “number_of_samples” parameter
   was set to 5 (a number equal to the sampled collections). The false
   discovery rate (FDR) threshold value set to 0.05, to control the false
   positive. Finally, an overall SNP loci dataset and putatively adaptive
   SNP loci dataset were created and used for all down-stream spatial
   clustering analyses.

2.4. Power Analysis

   A power analysis using POWSIM v. 4.1 [[140]69] was carried out to
   determine the power of all SNP loci and putatively adaptive SNP loci
   datasets derived from the SkimSeq approach. This program evaluates the
   statistical power of genetic homogeneity in individual species and
   allows the user to adjust a number of user-defined parameters. To
   calculate the power of our sampling design, the number of
   subpopulations was set to five (equals to the number of our collection
   sites), with 10 samples per subpopulation (number of the collected
   individuals per sampling site) for both all of the SNP loci and the
   putatively adaptive SNP loci datasets. The effective population size of
   the subpopulations were set to 1000, 2000 and 3000, and generation time
   (t) was adjusted to assess power at multiple F[ST] values (10 and 20
   generations), as described previously [[141]39]. As F[ST] in POWSIM
   assumes the independence of the subpopulations, power was expressed as
   the proportion of significant outcomes for 1000 alterations per batch
   and a statistically significant test (p < 0.05).

2.5. Genetic Variation Analysis

   The filtered data was imported as a genind object into R and
   down-stream spatial clustering analyses were largely conducted using
   the adegenet v2.0.1 R package
   ([142]http://cran.r-project.org/mirrors.html) [[143]70]. The GenoDive
   version 3.0 (Universiteit van Amsterdam, Amsterdam, The Netherlands)
   was used to conduct an analysis of molecular variance (AMOVA) on both
   datasets [[144]71]. The GenoDive was also used for significance testing
   of pairwise F[ST] to determine the genetic differences between
   collection sites for the all SNP loci and outlier datasets using the
   default settings, with the samples grouped by collection sites.
   Clustering analysis, discriminant analysis of the principal component
   (DAPC), was conducted using the adegenet R package (Universite’ de
   Lyon, UMR 5558, Lyon, France) on all SNP loci dataset and outlier
   datasets. Neighbor-joining trees were generated using all the SNP loci
   dataset and outlier datasets using Nei’s genetic distance method. The
   Bayesian clustering method, implemented in the STRUCTURE software v.
   2.3.4 (Stanford University, Stanford, CA, USA), was used to genetically
   assign individuals to clusters [[145]72]. Simulations were run for
   100,000 steps, following a burn-in period of 100,000 steps, considering
   values of K (number of clusters) from one to 15, with 10 replications
   for each value of K. The analysis was performed using an admixture,
   correlated allele frequencies, and no prior information on the sampling
   location or morphological species. For each individual, the program
   identifies the fraction of the genome that belongs to each one of the
   clusters. The rate of change in the log likelihood between successive K
   values was also estimated [[146]73]. The calculations were performed
   using STRUCTURE HARVESTER [[147]74]. The clusters of the estimated
   population structure were visualized using CLUMPAK [[148]75].

2.6. GO and KEGG Enrichment Analysis of Putatively Adaptive SNP Loci

   To identify the genes encoded within the adaptive SNP loci, a homology
   search program was applied for each putatively adaptive SNP locus with
   the genome sequences available for the L. vannamei in the NCBI
   database. For each detected 4983 SNP loci, flanking region of upstream
   and downstream 100 bp was extracted from the reference nucleotide
   sequence. Homology search between this extracted sequence set and CDS
   amino acid sequences of L. vannamei was performed under the homology
   threshold of e-value = 1 × 10^−5. Homology search program of the SNP
   flanking sequences showed that the total of 50 coding regions
   containing nonsynonymous mutations were encoded by the putatively
   adaptive SNP loci. To know the functional distribution of the genes
   encoded within the adaptive SNP loci, gene ontology (GO) and Kyoto
   Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis
   were performed using KOBAS 3.0 software
   ([149]http://kobas.cbi.pku.edu.cn/kobas3) and visualized with R
   [[150]76]. The hypergeometric test was used to identify the significant
   GO and KEGG pathways (p < 0.05).

2.7. Data Accessibility

   Our raw data for 50 individuals were submitted to the DDBJ
   ([151]https://ddbj.nig.ac.jp), with the DRA accession number:
   DRA010601. The data will be publicly available from 10 September 2020.

3. Results

3.1. Genome Assembly, Annotation and Quality Filtering of SNP Loci

   The de novo assembley of good quality data generated a total of
   6,425,442 scaffolds with the largest and smallest scaffold size of
   15,300 bp and 100 bp, respectively, and totaling 1,531,786,734 base
   pairs in length ([152]Table 2). Genome annotation analyses displayed
   that 440,707 scaffolds matched with at least one protein sequence, and
   107,063 Decapoda and 42,351 SwissPort protein sequences were matched to
   the scaffold ([153]Table 2). A total of 17,226,908 variant sites were
   identified across 3,651,235 scaffolds. Among 17,226,908 variant sites,
   1,347,070 were annotated with sequence similarity to Decapoda or
   SwissProt protein sequences ([154]Table 2).

Table 2.

   Summary of the genome assembly, genome annotation, variants
   identifications and variant annotation results obtained from the skim
   sequencing (SkimSeq) data of Penaeus monodon broodstock population
   collected from different Indo-Pacific regions.
   Genome Assembly Statistics
   Number of scaffolds 6,425,442
   Size of largest scaffolds (bp) 15,300
   Size of smallest scaffolds (bp) 100
   Total scaffold size (bp) 1,531,786,734
   Scaffold N50 306
   Genome annotation statistics
   Number of Scaffolds with at least one protein sequence match 440,707
   Number of Decapoda (order) protein sequences with match to scaffold
   107,063
   Number of SwissProt protein sequences with match to scaffold 42,351
   Variant identification results
   Number of Scaffolds with at least one variant site 3,651,235
   Total number of variants 17,226,908
   Variant annotation results
   Total number of variants with annotations 1,347,070
   Number of variants annotated with Decapoda proteins 1,339,785
   Number of variants annotated with SwissProt proteins 142,710
   [155]Open in a new tab

   Among the 17,226,908 variant sites, 10,212,187 SNP loci were retained
   after removing indels, MNPs and SNP sites with less than 5% minor
   allele frequency ([156]Table 3). After all of the filtering steps were
   complete, a total of 194,259 individual SNP loci remained in the
   dataset. Out of the 194,259 polymorphic SNPs loci, 4983 SNP loci were
   identified as outliers and putatively under positive selection by the
   pcadapt approach ([157]Table 3, [158]Figure 2).

Table 3.

   Number of single nucleotide polymorphism (SNP) loci remained after each
   quality filtering steps for dataset using all sampling groups.
   Filtering Steps Number of Loci
   Total number of raw variants loci 17,226,908
   Remaining SNP loci after excluded indels and MNP sites 13,530,393
   Remaining SNP loci after excluded sites with MAF < 0.05 10,212,187
   Remaining SNP loci after excluded sites with missing genotypes in >80%
   of the samples in any population 417,048
   Remaining SNP loci after excluded sites with genotypes not in
   Hardy-Weinberg equilibrium in any population (PHWE < 0.001) 328,028
   SNP loci remained after all quality filtering 194,259
   Number of putatively adaptive SNP loci 4983
   [159]Open in a new tab

Figure 2.

   [160]Figure 2
   [161]Open in a new tab

   Separation of the putatively adaptive panels of single nucleotide
   polymorphisms (SNPs) loci, based on the pcadapt approaches. Among the
   194,259 SNP loci, the pcadapt approach detected 4983 SNPs, as putative
   adaptive loci (above the dotted line).

3.2. Power Analysis

   In all of the SNP loci dataset, power was somewhat dependent on the
   presumed effective population size and time since separation,
   fluctuating from around 0.824 to 1 ([162]Table 4). Nevertheless,
   putatively adaptive SNP loci dataset provided comparatively higher
   power than the all the SNP loci dataset, varying from 0.985 to 1
   ([163]Table 4). However, both the SNP loci datasets provided more than
   adequate power to detect genetic difference among the five broodstock
   populations of P. monodon.

Table 4.

   Results of the power analysis conducted on all 194,259 SNP loci and
   4983 putatively adaptive SNP loci.
    All 194,259 SNP Loci  4983 Putatively Adaptive SNP Loci
   t  Ne       Power      t  Ne             Power
   10 1000     0.945      10 1000           1.000
   20 1000     1.000      20 1000           1.000
   10 2000     0.884      10 2000           1.000
   20 2000     0.912      20 2000           0.994
   10 3000     0.824      10 3000           0.996
   20 3000     0.902      20 3000           0.985
   [164]Open in a new tab

3.3. Demographic Interpretations from F[ST] Statistics and AMOVA Analysis

   The pairwise F[ST] value of five broodstock populations of P. monodon
   for the all SNP loci dataset were markedly lower than the pairwise
   F[ST]estimates for the putatively adaptive SNP loci dataset ([165]Table
   5). For all the 194,259 SNP loci, pairwise F[ST] estimates ranged from
   0.003 to 0.153, with an overall average value of 0.089 ([166]Table 5).
   For the 4983 putatively adaptive SNP loci, the pairwise F[ST] value
   displayed higher values, ranging from 0.003 to 0.870, with an overall
   average value of 0.789. Low genetic differentiation, as observed by low
   pairwise F[ST] value, was reported between the two wild-caught (MJ vs.
   MS) broodstock populations of P. monodon for all SNP loci (F[ST] =
   0.008; p = 0.001) and putatively adaptive SNP loci (F[ST] = 0.106; p =
   0.000) datasets. Similar pattern of genetic differentiation was also
   observed between the three domesticated (MMD vs. MMO; MT vs MMO; MT vs.
   MMD) broodstock populations of P. monodon, with pairwise F[ST] values
   ranging from 0.003 (MMD vs. MMO) to 0.010 (MMO vs. MT) in all the SNP
   loci dataset, and 0.003 (MMD vs. MMO) to 0.042 (MMD vs. MT) in the
   putatively adaptive SNP loci dataset. In contrast, a high genetic
   divergence was observed between domesticated and wild populations, with
   pairwise F[ST] values ranging from 0.145 (MMD vs. MS) to 0.153 (MJ vs.
   MT) in all the SNP loci dataset, and 0.836 (MJ vs. MMO) to 0.870 (MMD
   vs. MS) in the putatively adaptive SNP loci dataset. All of the
   pairwise F[ST] comparison were significant (p ≤ 0.001), except for
   those between MMD and MMO for putatively adaptive SNP loci (p = 0.105)
   ([167]Table 5).

Table 5.

   Pairwise F[ST] values (below diagonal) and associated p values (above
   diagonal) for the all SNPs loci and putatively adaptive SNP loci of
   giant tiger shrimp Penaeus monodon wild and domesticated broodstock
   populations collected from different Indo-Pacific regions.
          Wild       Domesticated
        MJ    MS    MMO   MMD   MT
             All SNP loci
   MJ   --   0.001 0.001 0.001 0.001
   MS  0.008  --   0.001 0.001 0.001
   MMO 0.151 0.147  --   0.023 0.001
   MMD 0.150 0.145 0.003  --   0.001
   MT  0.153 0.148 0.010 0.008  --
     Putatively adaptive SNP loci
   MJ   --   0.000 0.000 0.000 0.000
   MS  0.106  --   0.000 0.000 0.000
   MMO 0.836 0.856  --   0.105 0.002
   MMD 0.853 0.870 0.003  --   0.000
   MT  0.850 0.868 0.035 0.042  --
   [168]Open in a new tab

   When all SNPs were used, genetic variation inferred by a hierarchical
   AMOVA reveals the largest component of genetic variability (74.1%)
   within the individual level ([169]Table 6). The portion of genetic
   divergence captured by AMOVA among the individuals was 16.1% (p =
   0.000). A substantially low (9.8%) but significant divergence (p =
   0.000) was observed among five broodstock populations of P. monodon
   ([170]Table 6). Overall genetic structure yielded by AMOVA using all
   SNP loci resulted in a significant F[ST] value of 0.098 (p = 0.000).
   Interestingly, hierarchical AMOVA based on the putatively adaptive loci
   revealed a high genetic divergence among five broodstock populations of
   P. monodon (80.8%; p = 0.000), but no differentiation was observed
   among individuals (0.0%; p = 0.483), while the remaining variation
   within individuals was 19.2% ([171]Table 6). The hierarchical AMOVA for
   the putatively adaptive SNP loci also resulted in a significant F[ST]
   value of 0.808 (P = 0.000), indicating a high level of genetic
   divergence of P. monodon broodstock populations.

Table 6.

   Molecular analysis of variance for the all SNP loci and putatively
   adaptive SNP loci of five giant tiger shrimp Penaeus monodon broodstock
   populations collected from different Indo-Pacific regions.
   Source of Variation Sum of Square Variance Components % of Variation
   Statistics p-Value
   All SNP loci
   Within individuals 1,056,256 25,309.931 74.1 F_it = 0.259 -
   Among individuals 1,286,226.1 5502.952 16.1 F_is = 0.179 0.000
   Among populations 355,088.8 3342.094 9.8 F_st = 0.098 0.000
   Putatively adaptive SNP loci
   Within individuals 10,205.5 238.184 19.2 F_it = 0.808 -
   Among individuals 8192.9 −0.083 0.0 F_is = 0.000 0.483
   Among populations 63029.16 1004.217 80.8 F_st = 0.808 0.000
   [172]Open in a new tab

3.4. Genetic Structure Based on Clustering Analyses

   Inferring from several spatial clustering analyses of all SNP loci and
   putatively adaptive SNP loci datasets, substantial differences in
   genetic structure patterns of five broodstock populations of P. monodon
   were observed. A discriminant analysis of principal components (DAPC)
   of all SNP loci datasets revealed that MT and MS populations were
   pronouncedly overlapped among each other, while others three
   populations were more or less discriminated from each other’s
   ([173]Figure 3A). For putatively adaptive SNP loci dataset, DAPC
   analysis revealed that wild-caught broodstock populations (MS and MJ)
   were closely associated with each other’s ([174]Figure 3B). The
   clustering patterns of all and putatively adaptive SNP loci datasets
   were also analyzed by neighbor-joining (NJ) trees based on Nei’s
   genetic distances, both of which revealed a divergent genetic structure
   of five broodstock populations of P. monodon into two clusters: (1)
   wild-caught populations (MS and MJ), and (2) domesticated populations
   (MMD, MMO and MT). However, NJ tree topology for all SNP loci dataset
   showed lower Nei’s genetic distance compared to the putatively adaptive
   SNP loci dataset, and only one wild-caught individual from Japan (MJ6)
   incongruently clustered with domesticated broodstock populations of P.
   monodon ([175]Figure 4). Like NJ trees, STRUCTURE program also detects
   distinct genetic clusters (K) within a set of populations using a
   Bayesian clustering model. The number of clusters that best explain the
   genetic variation in the dataset is determined by estimating the
   cross-validation error. Similar to NJ trees, admixture analysis using
   Bayesian STRUCTURE of both all SNP loci and putatively adaptive SNP
   loci datasets also revealed two distinct genetic clusters (K = 2)
   within five broodstock populations of P. monodon ([176]Figure 5). One
   spatial clustering was observed for wild-caught populations collected
   from MS and MJ regions, while another spatial clustering was observed
   for the domesticated populations collected from the MMD, MMO and MT
   regions ([177]Figure 5).

Figure 3.

   [178]Figure 3
   [179]Open in a new tab

   Plots showing the discriminant analysis of principal components (DAPCs)
   of genetic differentiation for the all (A) and the putatively adaptive
   (B) SNP loci of five broodstock populations of Penaeus monodon. Ovals
   are the inertial ellipse, dot represent individual genotypes and the
   line extends to centroids of each population Here, MJ indicate samples
   from Shizuoka, Japan (wild); MMD indicates samples from Mahajamba,
   Madagascar (Domesticated); MMO indicates samples from Hawaii, HI, USA
   (Domesticated); MS indicate samples from Setiu Wetland, Malaysia
   (wild); MT indicates samples from Petchaburi, Thailand (Domesticated).

Figure 4.

   [180]Figure 4
   [181]Open in a new tab

   Neighbor-joining trees based on the Nei’s genetic distances for all SNP
   loci dataset (A) and the putatively adaptive panel of SNP loci dataset
   (B) of five broodstock populations of Penaeus monodon in the
   Indo-Pacific region. Branch nodes are denoted as the percentage of
   bootstrap support that was generated with 1000 replicates. Here, MJ
   indicate samples from Shizuoka, Japan (wild); MS indicate samples from
   Setiu Wetland, Malaysia (wild); MMD indicates samples collected from
   Mahajamba, Madagascar (Domesticated); MMO indicates samples from
   Hawaii, HI, USA (Domesticated); MT indicates samples from Petchaburi
   Province, Thailand (Domesticated). The numeric number next to the
   sample name abbreviation indicates the respective individual tag number
   during collection.

Figure 5.

   [182]Figure 5
   [183]Open in a new tab

   Bayesian STRUCTURE bar plot for all SNP loci dataset (A) and the
   putatively adaptive panel of SNP loci dataset identified by pcadapt
   approaches (B) of five broodstock populations of Penaeus monodon in the
   Indo-Pacific region. Each color represents the proportion of inferred
   ancestry from K ancestral populations and each bar represents an
   individual sample. Based on the delta K statistic, the best supported
   number of a posteriori genetic clusters was K = 2 for the standard
   admixture model. Here, MJ indicate samples from Shizuoka, Japan (wild);
   MS indicate samples from Setiu Wetland, Malaysia (wild); MMD indicates
   samples from Mahajamba, Madagascar (Domesticated); MMO indicates
   samples from Hawaii, HI, USA (Domesticated); MT indicates samples from
   Petchaburi Province, Thailand (Domesticated). The numeric number next
   to the sample name abbreviation indicates the respective individual tag
   number during collection.

3.5. GO Categorization of the Encoding Genes of Putatively Adaptive SNP Loci

   The 4983 putatively adaptive SNP loci sequences identified as outliers
   with the pcadapt approach were blasted against the genome of the
   Pacific white shrimp (L. vannamei), and 50 of them yielded significant
   matches with CDS amino acid sequences ([184]Table S3). Among the 50 CDS
   amino acids sequences, 5 (COX3, ND4, ND6, CYTB, ND1) originated from
   mitochondrion, 16 were from known characterized protein, while the
   remaining 29 were uncharacterized protein [185]Table S3. The 50 CDS
   were then subjected to GO and KEGG pathway enrichment analysis to
   discern the functional characterization of the genes encoded within the
   identified adaptive SNP loci. The GO results showed that most of the
   enriched pathways were linked to physiological functions, such as
   respiration, metabolism and energy derivation through mitochondrial
   oxidative phosphorylation and other organic compounds ([186]Figure 6,
   [187]Table S4). Many of these genes were involved in respiratory
   electron transport chain, respiratory chain complex, cellular
   respiration, aerobic respiration, respirasome, oxydoreductase activity,
   oxydative phosphorylation, organelle envelop, mitochondrion,
   mitochondrial envelope, membrane protein complex, generation of
   precursor metabolites and energy, energy derivation by oxidation of
   organic compounds, catalytic complex and the ATP metabolic process.
   KEGG pathway enrichment further demonstrated the involvement of the
   encoded genes in pathways related to metabolism, oxidative
   phosphorylation, cardiac muscle contraction and endocytosis, which is
   in coordination with the results of GO enrichment ([188]Figure 7,
   [189]Table S5). In addition, this study reveals an array of
   uncharacterized proteins with unknown functions. One uncharacterized
   protein LOC113820286 also appeared to be involved in respiration, the
   mitochondrial protein complex, the generation of precursor metabolites
   and energy and electron transport and ATP synthesis, like known mt
   proteins ND4, ND1, CYTB and COX3 ([190]Table S4).

Figure 6.

   [191]Figure 6
   [192]Open in a new tab

   Gene ontology (GO) pathway enrichment analysis for the 50 genes encoded
   by the putatively adaptive SNP loci in P. monodon population.

Figure 7.

   [193]Figure 7
   [194]Open in a new tab

   Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enrichment
   analysis for the 50 genes encoded by the putatively adaptive SNP loci
   in P. monodon population.

4. Discussion

   Revealing the population structure patterns of P. monodon broodstocks
   is important for the systematic monitoring and management of both
   natural and wild populations. This study applied SkimSeq-based data to
   explore the variation of P. monodon broodstocks collected from natural
   habitats and shrimp farms in the Indo-Pacific regions. Loci with a high
   resolving power and potentially under selective process were detected.
   We observed a remarkable divergence between the domesticated and wild
   broodstocks. Although similar genetic discriminations were observed,
   putatively adaptive SNP loci most powerfully detected the genetic
   discrimination, as revealed by demographic interpretations and inferred
   clustering analyses. Compared to conventional molecular markers used
   for population genetic studies, our SkimSeq approach uses unbiased
   whole genome sequences to accurately identify the traces of selection
   that cause genetic differentiation using a lower coverage area with
   combination of small population size [[195]77,[196]78]. We noted that
   the impact of random genetic drift is larger in smaller sample size
   [[197]79]. However, empirical studies have denoted that high throughput
   DNA sequencing have compensated smaller sample size with large number
   of generated SNP loci, to ensure high accuracy in estimating population
   genetic parameters [[198]80,[199]81,[200]82]. Sample size as low as
   four individuals have been documented to be efficient in providing a
   precise estimate of FST values [[201]83,[202]84]. Indeed, a universal
   sample size rule may not be feasible to address the complexities in
   genomic kinship estimates [[203]85]. As such, with the supporting power
   analysis output, the number of samples used in the present study is
   optimal in generating datasets of high precision.

   AMOVA of putatively adaptive SNP loci showed a high total variance of
   80.8% attributed to differences between the wild and domesticated
   groups. The low pairwise F[ST] between wild populations of MS and MJ
   indicate that the two broodstock populations have genetic affinity,
   which could be the result of reciprocal transport, despite showing no
   significant geographic proximity, and this genetic pattern can be seen
   in other species, such as giant freshwater prawn (Macrobrachium
   rosenbergii) [[204]86,[205]87]. The NJ tree ([206]Figure 4) and
   Bayesian STRUCTURE analysis ([207]Figure 5) suggest limited genetic
   differentiation between all cultured populations (MMO, MMD and MT),
   implying that they might be derived from similar sources prior to
   domestication. Significant genetic similarity between the cultured
   populations may have stemmed from similar founder populations and
   selection procedures, which are normally practiced in shrimp industry.

   Wild populations (MS, MJ) co-locate on another branch in NJ tree,
   indicating they have a separate origin from the cultured populations
   [[208]86]. Besides, pairwise F[ST] estimates, AMOVA, DAPC plot and
   STRUCTURE analysis agree with the genetic homogeneity of domesticated
   populations and their significant differentiation from the wild
   progenitor populations [[209]88,[210]89], particularly more significant
   for the putatively adaptive dataset. Significant genetic
   differentiation between domesticated stocks and wild populations has
   not been commonly observed in studies of the same species [[211]37],
   but in other species such as salmon (Salmo salar) [[212]90], grass carp
   (Ctenopharyngodon idella) [[213]91,[214]92] and Asian seabass (Lates
   calcarifer) [[215]93].

   In the present study, only one individual from MJ population was
   genetically identical to the cultured populations. Wild broodstocks,
   obtained as founders for shrimp domestication program, have been
   subjected to mass selection process over many generations, which may
   render lower genetic affinity between the cultured and wild
   populations. We also hypothesize that the founder broodstocks for the
   domestication program may have originated from more than one shrimp
   breeding company, and wild populations of different geographical
   localities. A lack of significant differentiation among domesticated
   populations may suggest the probability of a relatively short
   domestication history or genetically closely related populations
   [[216]94].

   Lack of gene flow between domesticated and wild populations is not
   unexpected when the confined environment of shrimp hatcheries and farms
   were taken into consideration. Moreover, our experimental design have
   defiled the potential occurrence of escapees from farms to the natural
   environment, given that all individuals were only obtained from
   populations, which were isolated completely from each other with a
   minimum of 500 km apart (MT and MS). The genetic structuring pattern
   being impacted by farm escapees was not uncommon, and has been reported
   in P. monodon [[217]33,[218]95]. However, more shrimp samples of wild
   and cultured stocks from different regions need to be analyzed to
   validate this genetic distinction. In fact, previous studies have
   suggested that fragmentation was commonly observed within the penaeid
   shrimp populations that were geographically separated by smaller
   distance [[219]96].

   The genetic homogeneity between the domesticated populations may be due
   to the artificial selection of favorable traits or adaptation to
   similar aquaculture practices. Aquaculture practices have been observed
   to reduce genetic variability in farmed reared stocks of other aquatic
   species [[220]97,[221]98,[222]99]. Despite being geographically
   isolated, the genetic similarity among the wild populations, could
   possibly be linked to adaptive fitness to similar environmental
   conditions [[223]100,[224]101]. Convergent evolution has been
   documented in abalone (Haliotis midae) [[225]102] and scallops
   (Pectinidae) [[226]103], inhabiting analogous ecological niches, which
   subsequently develop consonant phenotypic traits. In addition, early
   population genetic studies on penaeid shrimp based on various
   techniques, including allozyme, RAPD and mtDNA analyses, showed that
   small genetic differences in these species were attributed to the
   dispersal ability, life history of shrimp and lack of physical barriers
   in the marine environment [[227]32,[228]95,[229]104,[230]105].

   The effect of random genetic drift is more accentuated in smaller
   sample size [[231]106]. Unequal sex ratio or differential reproductive
   contributions of the broodstocks in most breeding programs may cause
   random genetic drift [[232]94,[233]107]. In the present study, the low
   genetic differences between domesticated populations might be partially
   caused by random genetic drift or regionally different selective
   regimes [[234]108]. Although this factor is a determinant key in
   dramatic depression of genetic variability in domesticated populations,
   it is beyond the scope of this study.

   Unique genes also persists in differentiating the domesticated and wild
   populations, which can be explained by selection of differentially
   favored alleles, holding particular reference to genetic improvement
   program in captive breeding environment [[235]89]. The evidence of
   differential selection between the two groups ([236]Figure 4 and
   [237]Figure 5) highlighting the role of selection as a major
   evolutionary force in driving genetic divergence, specifically in
   domesticated populations. Selective pressures may be in part
   responsible for facilitating population variation; considering that
   these domesticated stocks have been undergoing grading procedures where
   undesired specimens were culled in the entire production system
   [[238]89]. On the other hand, population heterogeneity connected to
   adaptation to environmental factors or ecological niches is well
   revealed in wild populations [[239]102,[240]109,[241]110]. The
   development of ecotypes is well documented for many aquatic species in
   various environments where environmental clines persist
   [[242]111,[243]112,[244]113].

   The putatively adaptive SNP loci identified by the pcadapt approach
   also identifies genomic regions associated with the strong artificial
   selection during the domestication process over temporal scales. Of the
   4983 putatively adaptive SNPs, only 50 genes were encoded successfully
   annotated through BLAST analysis. We have an array of uncharacterized
   proteins, with their functions are not known. Moreover, it was also
   observed that five mitochondrial genes (COX3, ND4, ND6, CYTB, ND1) were
   mutated among different populations of P. monodon. mtDNA has been
   widely used in population genetic studies to reconstruct phylogenetic
   relationships and analyze population structure [[245]114], due to its
   unique characteristics such as maternal inheritance, neutrality, higher
   mutational rate than nuclear DNA and little to no recombination
   [[246]115,[247]116,[248]117]. However, the prominent functions of a set
   of proteins encoded by mitochondrial genome in cellular energy
   production question its utility as a neutral marker [[249]118]. In
   fact, our results from the KEGG pathway enrichment analysis has also
   suggested the presence of mutations under positive selection for the
   peptides, in which their functional properties are highly related in
   metabolic efficiency. Mitochondria produce 95% of cellular energy
   through oxidative phosphorylation of ADP (adenosine diphosphate) to
   form ATP (adenosine triphosphate). There is evidence that shows that
   several proteins containing mitochondrial encoded amino acids are
   involved not only in ion translocation, but also in mitochondrial
   respiration regulation [[250]119,[251]120]. Variation in mtDNA maybe
   taxon specific [[252]121], and is linked to a range of environmental
   conditions affecting metabolic processes [[253]122], development rates
   [[254]123], biological ageing [[255]124] and fitness. Mutations in the
   mitochondrial genome may affect the ability of mitochondria in ATP
   production [[256]125], and putative maternal effects of mitochondrial
   genome on fish growth rate has been observed in other species, such as
   the Atlantic salmon [[257]126]. Natural selection has also favored
   co-adaptative functions between mtDNA and nuclear DNA in maintaining
   optimal metabolic function and, thus, shaping the evolution of a
   species populations [[258]127,[259]128]. Despite being extensively
   utilized in manifold population structure studies, our findings have
   discovered the core functional properties of mitochondrial genome,
   particularly those involved in metabolic and energy productions as the
   driving force of population divergence. These fine attributes of
   mitochondrial functional properties warrant further new exploration and
   continual usage as one of the parameters in population genomics studies
   of various organisms. It has also been noticed that several genes are
   likely linked to other biological functions like molting (cuticle
   protein 21-like) [[260]129], developmental process (neurotrophin
   1-like) [[261]130] and immunity (baculoviral IAP repeat-containing
   protein, serine protease 42-like) [[262]131,[263]132]. However, the
   present findings demonstrated that the nonsynonymous mutations in these
   encoded genes might be associated with different biological functions,
   which may enable P. monodon broodstocks to adapt to strong artificial
   selection during domestication process.

5. Conclusions

   We observed a pronounced genetic divergence between the wild and
   domesticated broodstock populations of P. monodon across the
   Indo-Pacific region. We suggested that similar founder populations,
   artificial selection regimes of desired commercial traits and local
   adaptive process to similar aquaculture practices have dramatically
   reduced the genetic heterogeneity of the domesticated stocks. Despite
   being geographically isolated, we denote that the patterns of genetic
   homogeneity in wild populations maybe strongly influenced by
   hydrographic conditions and ecological niches which are expected to
   increase the adaptive processes of the populations towards their
   natural habitats. Genetic differences between wild and domesticated
   populations presented in this study can be likely explained by a number
   of unique genes within the putatively adaptive SNP loci. These genes
   were found to be significantly associated with the mitochondrial
   genome. Our results from KEGG pathway analysis further reinforce the
   potential selective force which have diverged both groups. We noted the
   presence of polymorphism in the mitochondrial regions, notably those
   that are related to the energy production, metabolic functions,
   respiration regulation and developmental rates. The combination of
   these functional properties of peptide encoded by mitogenome are linked
   to various environmental parameters, and this likely acting to promote
   the genetic isolation of domesticated populations from their wild.
   Separated by geographical distance and various selection breeding
   programs, these two groups may each develop local adaptation to
   ecological niches, biological traits and demographic histories found
   within their thriving habitats. Taken together, this study has
   demonstrated the applicability of SkimSeq in a highly duplicated genome
   of penaeid shrimp, P. monodon specifically, across a range of genetic
   backgrounds and geographical distributions. This ultra-low sequencing
   coverage has enabled the sequencing of low number of individuals to
   achieve either similar or perhaps more powerful coverage, but with
   lower cost than other traditional GBS methods. In future work, larger
   sample sizes from a larger number of populations collected from wider
   distribution range should be included within a similar comparative
   framework, to validate if this specific trend represents a general rule
   associated with the genetic divergence of P. monodon broodstocks.

Acknowledgments