Abstract

   Brown trout (Salmo trutta), like many other freshwater species, is
   threated by the release in its natural environment of alien species and
   the restocking with allochthonous conspecific stocks. Many conservation
   projects are ongoing and several morphological and genetic tools have
   been proposed to support activities aimed to restore genetic integrity
   status of native populations. Nevertheless, due to the complexity of
   degree of introgression reached up after many generations of crossing,
   the use of dichotomous key and molecular markers, such as mtDNA,
   LDH-C1* and microsatellites, are often not sufficient to discriminate
   native and admixed specimens at individual level. Here we propose a
   reduced panel of ancestry-informative SNP markers (AIMs) to support on
   field activities for Mediterranean trout management and conservation
   purpose. Starting from the genotypes data obtained on specimens sampled
   in the main two Molise’s rivers (Central-Southern Italy), a 47 AIMs
   panel was identified and validated on simulated and real hybrid
   population datasets, mainly through a Machine Learning approach based
   on Random Forest classifier. The AIMs panel proposed may represent an
   interesting and cost-effective tool for monitoring the level of
   introgression between native and allochthonous trout population for
   conservation purpose and this methodology could be also applied in
   other species.

   Keywords: Mediterranean trout, introgression, SNP array, machine
   learning, random forest, ancestry informative markers

1. Introduction

   The introduction of alien species is causing dramatic changes in many
   fresh water ecological systems worldwide, determining the erosion of
   local species integrity [[32]1,[33]2]. Among the others, brown trout
   (S. trutta) has traditionally attracted the attention of conservation
   biologists and public institutions due to its iconic significance for
   fishery management and aquaculture. Brown trout is considered a complex
   of incipient species [[34]3,[35]4], which counts several phylogenetic
   lineages including Mediterranean trout [[36]5]. Recently, many
   conservation projects have been proposed to restore the genetic
   integrity status of Mediterranean trout populations in Italy
   [[37]5,[38]6,[39]7,[40]8,[41]9,[42]10], mainly threated by
   introgressive hybridization between native and commercial hatchery
   strains, often introduced for meeting fishing demands and for
   recreation activities. Morphological features (such as number of parr
   marks, adipose fin color pattern and number of black opercular spots
   [[43]11]) have been successfully adopted in order to quickly
   differentiate among native, farm-reared and hybrid specimens during
   preliminary steps of monitoring activities [[44]11] but the use of
   genetic markers remains pivotal to conduct a truly effective
   restoration project [[45]12].

   At a genetic level, the Mediterranean trout population can be easily
   discriminated from its Atlantic lineage through PCR-RFLP analysis of
   mtDNA segments in combination with nuclear LDH-C1* locus. Such loci
   have been extensively used in order to provide a rapid genetic
   characterization in many studies [[46]8,[47]9,[48]13], sometimes in
   conjunction with the use of microsatellite markers to better assess the
   effectiveness of specific conservation objectives [[49]14].
   Nevertheless microsatellites suffer of lack replicability among
   laboratories, and not guarantee a large genome coverage [[50]15] which
   is pivotal for the study of genetic structure at fine-scale. In this
   regard, recent advances in field of genomics resources and technologies
   represent key opportunities to overcome these issues and optimize
   conservation efforts in many wild species, including brown trout
   [[51]16,[52]17,[53]18]. In particular, medium-density SNP arrays are
   now available for some salmonid species, such as rainbow trout and
   Atlantic salmon [[54]19], and a recent large SNP array was developed
   also for brown trout [[55]15]. Nevertheless, the S. trutta complex is
   one of the most genetically diverse vertebrate groups, consisting of
   more than 60 species, including several ecotypes and evolutionary
   lineages [[56]20,[57]21]. In this context, the development of
   species-specific array is still unrealistic; thus, the use of large SNP
   microarrays developed in one species to analyze a closely related
   species with limited genetic resources can be considered as an
   effective alternative [[58]22]. In a recent work, Palombo et al.
   [[59]6] successfully used for the first time the rainbow trout derived
   57K SNP array [[60]23] for the genetic characterization of two
   Mediterranean trout populations inhabiting Molise rivers
   (Central-Southern Italy). The authors reported useful information for a
   fine-scale genetic structure characterization and such results
   supported conservation and monitoring activities implemented by LIFE17
   NAT/IT/000547 Project.

   In order to provide a most affordable and cost-effective solution to
   further support native trout conservation and management activities we
   decided to exploit the genetic information obtained in such a previous
   study [[61]6] to create a reduced SNP panel containing the most
   ancestry-informative markers (AIMs) with very little loss of
   information compared to initial Axiom 57K array genotyping solution.
   Several studies have shown that a reduced set of selected informative
   markers can effectively capture the genetic structure of populations in
   human and livestock [[62]24,[63]25,[64]26] and several statistical
   analyses [[65]25,[66]26,[67]27,[68]28], as well as commercial tools
   [[69]29], are reported to be helpful to this aim. Among the others, we
   decided to apply a Machine Learning (ML) approach, accordingly to what
   has been recently proposed by other authors [[70]25,[71]26,[72]30],
   which used a Random Forest (RF) classifier to identify population
   informative SNPs useful for pig, cattle, and wild sheep breed
   identification.

2. Materials and Methods

2.1. Filtering Procedures and Reference Dataset Building

   In total 288 specimens from Biferno and Volturno rivers, the main two
   basins of Molise, were enrolled in this study. As previously described
   [[73]6], those samples were collected within LIFE17 NAT/IT/000547
   Project’s activities and were genotyped with the rainbow trout Axiom
   SNP array [[74]23], as well as screened by PCR-RFLP technique at 16S
   rRNA and LDH-C1* loci, according to McMeel et al. [[75]31] and Chiesa
   et al. [[76]32]. Within such Project’s conservation activities, the
   combination of genotypes at 16S rRNA and LDH-C1* loci has been used as
   criterion for the identification of six different classes of
   introgression (i.e., from class I ‘completely introgressed’ to class VI
   ‘no introgressed’), according to Pensierini et al. [[77]33]. In
   particular, specimens of class VI have been declared as native and thus
   used for reproduction purpose during Project’s steps. Traditionally
   some authors have used different combinations of mitochondrial and
   nuclear genetic loci for genetic analysis of S. trutta
   [[78]10,[79]34,[80]35,[81]36].

   An initial quality control (QC) of the dataset was performed by first
   applying a filter based on individual genotyping success and retaining
   profiles with ≥80% success rate. Next, data was screened with a per-SNP
   genotyping threshold of ≥95% and pruned for loci deviating
   Hardy–Weinberg equilibrium (HWE; p-value ≤ 10^−3). SNPs that met the QC
   criteria were therefore used to perform a preliminary assessment of the
   entire dataset in Admixture software (v.1.3.0) [[82]37] with K = 2
   population cluster, roughly corresponding to native and alien trout
   population inhabiting Molise rivers, according to results reported by
   Palombo et al. [[83]6]. Admixture analysis was performed considering
   each river separately to exclude possible hybrids. Non-admixed
   specimens, assigned with an admixture ancestry score (q[i]) ≥ 0.99 to
   their respective population clusters (i.e., native and alien
   populations), were retained to construct a reference dataset for AIMs
   identification.

2.2. Marker Selection

   Four different statistical methods were employed on the dataset for
   marker information content estimation. Based on obtained results and
   considering that most common microplates have a 96-well plate format,
   the top 96 ranked SNPs were used to declare AIMs for each approach.
   Specifically, the first method, which has been one of the most popular
   for selecting informative loci, was the pairwise F[ST] estimation of
   Weir and Cockerham [[84]38] as calculated at each locus using PLINK
   software [[85]39]. The second method relied on allele-frequency
   differential (Delta) estimation [[86]40], which is one of the most
   straightforward ways to evaluate the information content of a SNP. In
   particular, for a bi-allelic marker, the Delta value is estimated as
   |pA[i] − pA[j]|, where pA[i] and pA[j] are the frequencies of allele A
   in the ith and jth subpopulation. The Delta value for each SNP locus
   was estimated as the mean across all pair-wise comparisons. The third
   selection method was principal component analysis (PCA). Informative
   markers were selected, according to Paschou et al. [[87]41],
   considering the sum of the squares of the most informative principal
   components (PC). The choice of the number of PC was determined by the
   amount of variance explained, as previously defined by Schiavo et al.
   [[88]26], and the number of PC chosen in our analysis was three. The
   loadings for each SNPs were squared and summed over the most
   significant PC in order to produce an estimate of informativeness and
   finally used to rank SNPs. The fourth selection method was the RF
   classifier, which is a supervised ML algorithm based on an ensemble of
   decision trees. Implementation of the RF has been done using
   scikit-learn Python library [[89]42]. RF algorithm measures the
   importance of a feature (i.e., a SNP) and evaluates the role of each
   feature in the classification that can be used as an indicator of SNP
   informativeness. SNPs were ranked using two different parameters
   implemented in criterion function: the Gini impurity (GI) and the
   Entropy (EN). Since different runs of the RF procedure can lead to
   slightly different results in terms of selected SNPs, 100 RF runs were
   performed. Finally, informative SNPs were selected based on two
   different procedures, according to Schiavo et al. [[90]26]: (1) SNPs
   that occurred more frequently among the first top 96 SNPs list, after
   the 100 runs; (2) SNPs with the highest importance average value over
   the 100 runs. These two methods were applied to evaluate the stability
   of RF selection and leaded to two different candidate SNP panels for
   each RF criterion applied. Overall, seven panels of 96 SNPs were
   obtained through the different statistical approaches used for SNPs
   prioritization. Finally, shared SNPs among all different analyses were
   considered as the best candidate AIMs and, in turn, they were tested in
   a separated RF analysis.

2.3. Validation Dataset

   In order to test and validate the AIM panels identified through marker
   selection analyses, a RF classifier was performed using genotype data
   from simulated hybrid populations. Simulated hybrids were artificially
   constructed using Hybridiser v0.1 R script developed by Somenzi et al.
   [[91]30]. More in detail, a dataset of 60 simulated hybrids was
   generated for both rivers as follows: (i) 20 F1 offspring obtained by
   native x alien specimens, (ii) 20 backcrosses between F1 x alien
   specimens (BC1A) and (iii) 20 individuals obtained as backcross between
   F1 and native trout (BC1N). Although, as already observed by Schiavo et
   al. [[92]26], RF does not need any cross-validation on a separate test
   set to get an unbiased estimate of the test set error, we decided to
   randomly split the validation dataset into a new reference (80% of
   specimens) and test population (remaining 20%). RF classifier was
   fitted on such new reference set and the corresponding out-of-bag (OOB)
   error score was calculated, which is an unbiased estimate of prediction
   accuracy. Classification performance was assessed also using the test
   population (i.e., animals not used to train the algorithm) and this
   allowed us to evaluate the fitted ML model.

   Furthermore, to visually compare the performance of the full set of
   SNPs and the candidate AIMs shared among the seven panels, PCA was
   performed considering both simulated and real hybrid populations. Real
   hybrids were extracted from initial dataset considering a q[i]
   admixture score <0.99. Finally, to measure how well the candidate AIMs
   estimated the admixture level compared to that determined by the full
   set of markers, we compared the admixture results using the coefficient
   of determination (r^2). To test if the AIM panels performed better than
   an equally sized set of SNPs chosen at random, 1000 random AIM sets
   were generated, and for each random set supervised admixture analysis
   was performed. Finally, coefficients of determination values between
   the ancestry assignment of the full set and the reduced random panel
   were computed. The coefficient of determination values obtained using
   the 1000 random SNP sets were standardized by z-scores.

2.4. SNP Annotation

   To further disentangle the information carried out by common AIMs
   identified across the seven reduced panels, 35 bp flanking sequence
   from each side of the SNP, provided by the array manufacturer, was
   aligned to S. trutta genome assembly (v. fSalTru1.1) by BLASTN
   software, considering an e-value cut-off of 1 × 10^−6 and a percent
   identity threshold of the matching sequence ≥85%. Hits were used to
   infer position on the reference genome and annotate genes spanning a
   region of ±50 Kbp around each SNP using Ensembl Variant Effect
   Predictor (VEP) tool (release 107) [[93]43]. In order to identify
   overrepresented terms in KEGG and GO knowledgebase, a pathway
   enrichment analysis was performed respectively by PANEV package
   [[94]44] and by g:Profiler toolset [[95]45], considering only annotated
   genes.

3. Results

3.1. Population Overview

   In total, 633 SNPs and 288 specimens passed QC filtering. Considering
   admixture outcomes (q[i] ≥ 0.99), 49 and 19 samples were classified as
   non-admixed native (NAT) or alien (ALI), respectively ([96]Table S1)
   and were considered as reference population for SNP prioritization. PCA
   plot obtained using the 633 SNPs on entire reference population showed
   a clear separation of NAT and ALI samples in both rivers ([97]Figure
   1). PC1 (41.18% of total variance) split NAT and ALI trout as two
   distinct clusters whereas PC2 (5.75% of total variance) identified
   subpopulation structure among NAT trout of Biferno or Volturno rivers.
   As regarding introgression classes estimated through the combination of
   16S rRNA and LDH-C1* genotyping, the outcomes are reported in [98]Table
   S1. The distribution of trout population ancestry scores for each
   introgression class, estimated with the combination of mtDNA and
   LDH-C1* genotypes [[99]33], was reported in [100]Figure S1, which
   suggested a heterogeneous scenario within each class. A preliminary PCA
   investigation on validation dataset composed by real hybrids was
   performed using the 633 SNPs and rerun on split by river datasets
   ([101]Figure 1). PCA plots showed an admixed scenario caused by a
   significant hybridization level, in line with observations reported by
   Palombo et al. [[102]6].

Figure 1.

   [103]Figure 1
   [104]Open in a new tab

   PCA obtained using the full SNPs set on reference and entire
   populations, encompassing non-admixed native and alien trout samples.
   In green and orange are reported the non-admixed native split by rivers
   and in red the alien samples, respectively. In brackets the percentage
   of variance explained by each component is reported.

3.2. Comparison of AIMs Selection Methods and Validation

   In total, seven different reduced panels were obtained, considering top
   96 ranked SNPs selected by four different approaches applied on
   reference population. One panel was obtained by F[ST], one by Delta,
   one by PCA statistics and four lists were derived using RF algorithm,
   applying GI and EN ranking methods (as described before, two stability
   procedures were tested for each applied methods). [105]Table S2 reports
   the lists of top-96 SNPs detected by each method and included in the
   seven panels and [106]Table 1 reports the number of shared AIMs between
   pairs of SNP panel determined with the seven different approaches.

Table 1.

   Number of SNPs shared between pairs of SNP panels determined with the
   seven different methods reported in this study (in the diagonal, the 96
   SNPs).
    Method RF GI 1 RF GI 2 RF EN 1 RF EN 2 Delta F[ST] PCA
   RF GI 1   96
   RF GI 2   83      96
   RF EN 1   89      81      96
   RF EN 2   85      89      84      96
    Delta    79      77      80      80     96
    F[ST]    81      81      80      83     88     96
     PCA     52      50      53      53     56     57  96
   [107]Open in a new tab

   In order to assess the reliability of the identified panels, a
   validation step was performed applying a RF classifier. OOB scores and
   correct prediction proportions are reported in [108]Table 2. All
   samples were correctly assigned (100%) across all methods on training
   set (train accuracy) with an average OOB score of 88%. Focusing on
   testing set, test accuracy values were >92% across all approaches. The
   highest OOB score was detected for Delta method (91%), lowest for RF EN
   1 (85%).

Table 2.

   Out Of Bag (OOB) and the accuracy classification scores obtained by RF
   algorithm considering the reference and the test trout populations by
   using the seven 96 SNP panels.
   Method OOB
   Score Train
   Accuracy Test
   Accuracy
   RF GI 1 90% 100% 95%
   RF GI 2 87% 100% 97%
   RF EN 1 85% 100% 95%
   RF EN 2 86% 100% 97%
   Delta 91% 100% 92%
   F[ST] 87% 100% 92%
   PCA 87% 100% 95%
   [109]Open in a new tab

   In total, 47 SNPs resulted in common among all top-ranked 96 SNP lists
   and therefore they were considered as the best candidate AIMs for the
   development of a reduced panel ([110]Table S3). The RF classifier
   validation was performed also considering the 47 candidate AIMs.
   Performance outcomes were in line with expectations (OOB score 86%,
   train accuracy 100%, test accuracy 92%). The 47 common AIMs panel was
   also tested to detect admixture between native and alien specimens in
   both rivers. R^2 values were high overall across all panels
   encompassing 96 SNPs (r^2 ≥ 0.973; [111]Table 3) and also the r^2
   calculated between the ancestry percentage obtained using 47 candidate
   AIMs and full set of SNP resulted quite high, i.e., 0.955 and 0.979 for
   Biferno and Volturno rivers, respectively ([112]Table 3).

Table 3.

   Coefficient of determination values (r^2) calculated between the
   ancestry percentages using the full set of SNPs and the AIM panels in
   case study populations. N is the number of SNPs in each panel.
    SNPs Panel   N  Biferno (r^2) Volturno (r^2)
       Delta     96     0.982         0.989
      F[ST]      96     0.981         0.988
        PCA      96     0.973         0.984
      RF EN 1    96     0.985         0.985
      RF EN 2    96     0.986         0.988
      RF GI 1    96     0.985         0.985
      RF GI 2    96     0.983         0.987
   Candidate AIM 47     0.955         0.979
   [113]Open in a new tab

   Furthermore, to visually compare the performance of the full set of
   SNPs to what obtained by 47 common AIMs, PCA was run considering both
   simulated and real hybrid populations for both rivers separately.
   Furthermore, OOB scores and correct prediction proportions on such data
   are reported in [114]Table S4. Overall, PCA plots showed a
   clusterization comparable with the PCA results obtained by the full set
   of SNPs. Indeed, PCA of real hybrids ([115]Figure 2) identified several
   individuals overlapping with the pure ancestry native cluster, while
   the others were distributed along a gradient between NAT and ALI.

Figure 2.

   [116]Figure 2
   [117]Open in a new tab

   PCA and density distribution of the PC1 obtained using the common 47
   AIMs on reference populations and real hybrid dataset split by rivers.

   PCA on simulated hybrids ([118]Figure 3) discriminated the parental
   populations (NAT and ALI) at opposite sides of the graph and positioned
   the hybrid populations according to their ancestry proportions, with F1
   at the center of the plot and the two backcrosses BC1N and BC1A closer
   to NAT and ALI, respectively.

Figure 3.

   [119]Figure 3
   [120]Open in a new tab

   PCA and density distribution of the PC1 obtained using the common 47
   AIMs on reference populations and simulated hybrid dataset split by
   rivers.

3.3. SNP Annotation and Marked Genes

   In total 466 out of 633 SNPs (~74%) were successfully mapped on S.
   trutta reference genome. Within panels, SNPs per chromosome ranged from
   1 to 8. Considering all panels together, there was a similar
   distribution per chromosome of the selected SNPs. Highest number of
   top-ranked SNPs was harbored on chromosome 12, 19 and 26 ([121]Figure
   4). Focusing on 47 common AIMs, 41 out if 47 were mapped on S. trutta
   genome assembly and in total 143 genes were pinpointed by VEP tool
   [[122]43], considering boundaries of 50 Kbp around each SNP ([123]Table
   4). No KEGG and GO terms were statistically significant overrepresented
   among our gene list.

Figure 4.

   [124]Figure 4
   [125]Open in a new tab

   Distribution on the 40 trout chromosomes of the SNPs selected for the
   96 SNP panels using the four different methods described in this study
   (RF GI 1 = random forest Gini Index stability occurrence; RF GI 2 =
   random forest Gini Index stability mean; RF EN 1 = random forest
   Entropy stability occurrence; RF EN 2 = random forest Entropy stability
   mean; Delta; F[ST] = Fixation index; PCA = principal component
   analysis).

Table 4.

   List of genes pinpointed by VEP tool within or close (<50 Kbps) to
   common SNPs included in the panels selected by the seven different
   methods used in this study (RF GI1, RF GI2, RF EN1, RF EN2, Delta,
   F[ST]and PCA).
   SNP Chr Genomic
   Position (bp) Gene(s)
   AX-89926492 1 36,478,362 ENSSTUG00000034565
   AX-89933844 3 56,808,067 MTX1A, THBS3A, ENSSTUG00000008371, HJV, ITGA8,
   ENSSTUG00000009416
   AX-89957249 3 49,711,361 ENSSTUG00000029747
   AX-89933361 4 43,428,599 CBR4, SH3RF1
   AX-89954271 5 11,245,515 GFRA1, CCDC172
   AX-89955512 5 31,053,287 FUOM, ENSSTUG00000037533, ZGC:66426, ERLIN1
   AX-89964745 6 28,034,638 MRC1A, SLC39A12, CACNB2
   AX-89965418 6 53,152,780 USP9, ENSSTUG00000025824, DDX3XA, TGDS,
   GPR180, SLC5A3A, SI:CH211-132G1.7
   AX-89922103 8 11,821,548 RGS12A, MSANTD1, DTX4A
   AX-89923685 8 40,108,784 ENSSTUG00000017894, ETV6, ENSSTUG00000017923
   AX-89930404 10 9,847,082 FMNL1, ENSSTUG00000009061, GRB7
   AX-89926808 12 25,219,858 CYP2R1, PDE3B
   AX-89935881 12 68,862,681 MED13L
   AX-89941680 12 72,156,842 GPSM1B, LHX3
   AX-89943019 12 78,807,143 MPDU1A, ESRRA, KCNK4, STX5AL, EHD1B
   AX-89944919 12 68,082,269 ENSSTUG00000031577, TJP2A,
   ENSSTUG00000036079, SMC5, ZFAND5B
   AX-89966227 12 24,351,329 POLR2L, DAGLA, EXT2, SYT7B, SDHAF2, CPSF7
   AX-89937326 13 48,590,388 GRIK4
   AX-89970985 13 27,301,563 PPP3CB, UBE2D2, ENSSTUG00000035840,
   ENSSTUG00000035847, PSD2
   AX-89928338 14 30,569,008 MRPL20, ATAD3A, TMEM240B, SSU72, ORA4,
   ENSSTUG00000008474, CCNL1B, VWA1
   AX-89965056 14 22,507,646 GATA2A
   AX-89976571 14 25,409,004 FHIT
   AX-89975434 15 23,622,069 IFT46, VPS11, HYOU1, H2AX1, ZPR1
   AX-89971379 16 37,104,371 PARK7, KCNAB2A
   AX-89961240 19 43,548,455 TMEM164, AMMECR1, KIF4, MRPS12,
   ENSSTUG00000016882, ENSSTUG00000016884, FIBPB
   AX-89961754 21 24,970,403 TMEM53, TESK2, TOE1
   AX-89969654 22 13,715,689 MYO9B, S1PR4, MIR24-4, ENSSTUG00000015334,
   ENSSTUG00000015336
   AX-89957356 24 15,818,961 DOCK9
   AX-89924719 25 31,489,217 MYCLA, NT5C1AB, ENSSTUG00000048637
   AX-89935421 25 23,270,420 ODC1, UTP25, ENSSTUG00000028825, ZGC:123321,
   LAMTOR3, ENSSTUG00000028866, ATP10D
   AX-89950643 25 32,760,767 RALGAPA2
   AX-89936803 26 26,628,471 SRPRA, FAM118B, ILVBL, ENSSTUG00000048185,
   ENSSTUG00000048296, B3GAT1A
   AX-89959464 26 22,256,971 SLC47A4, SLC47A3, SLC13A5B, SERPINF2B,
   ENSSTUG00000049175, ENSSTUG00000049189, RPA1
   AX-89948079 27 20,927,232 IGSF9B, ENSSTUG00000024232, TMEM127, CIAO1,
   SNRNP200, SLC20A1A
   AX-89961304 28 42,593,898 ENSSTUG00000021816, ASXL1, PCMTD2A, MYT1A
   AX-89963552 28 24,003,249 ENSSTUG00000043167, LRRC47, CEP104
   AX-89961685 29 17,322,240 -
   AX-89965310 33 12,906,283 ENSSTUG00000029959
   AX-89927784 35 3,818,258 FNDC1, OTOFA
   AX-89958723 35 4,892,093 ENSSTUG00000020096, TRMT6, FERMT1, BMP2B
   AX-89938669 38 8,489,802 RIMS1, KCNQ5B
   [126]Open in a new tab

4. Discussion

   An increasing popularity of SNP analysis tool is widely recognizable in
   wild conservation projects to discriminate between pure and
   alien/hybrid specimens. However, the large-scale use of SNP arrays can
   be challenging for the average financial availability of conservation
   projects; thus, the development of a small panel of AIMs can be
   considered an effective alternative
   [[127]25,[128]26,[129]30,[130]46,[131]47,[132]48].

   In this work, we evaluated marker selection methods and determine a
   small number of highly discriminant SNPs from the rainbow trout Axiom
   array, required to effectively and confidently assign individual
   genotypes to native and alien populations notably within LIFE17
   NAT/IT/000547 Project’s activities. More in detail, the Project had the
   main goal to restore genetic integrity of native Mediterranean trout
   populations inhabiting Molise’s rivers. Our final aim is to support the
   monitoring and conservation activities proposed by the Project, through
   the development of a reduced SNP panel, which guaranties a rapid and
   low-cost genotyping analysis without significantly compromising its
   informativeness. Indeed, although the combination of mtDNA and LDH-C1*
   loci genotypes can be a useful approach to suggest the introgression
   degree at the population level, its consistency at individual level is
   far from being accurate, especially after several generations. Our
   results would corroborate this general consideration, indeed classes of
   introgression estimated by combination of mtDNA and LDH-C1* loci
   genotypes [[133]33] were not always consistent with individual ancestry
   scores estimated by admixture ([134]Figure S1). A clear concordance
   apparently was not detectable.

   It is well-known that the accuracy of AIM panels depends on the quality
   and sample size of the reference populations. Clearly, a high number of
   genotyped samples helps to take into account the whole within
   population variability and, in turn, it reduces the possibility that
   few individuals might be not assigned correctly due to their atypical
   genotypes. Nevertheless, for many practical reasons it is not always
   possible to use large reference datasets. In our study, the number of
   specimens considered in the reference datasets was conditioned by
   Project objectives, which was focused on native trout conservation in
   Molise’s rivers. Furthermore, to the best of our knowledge, except for
   Palombo et al. study [[135]6] there are no other available data using
   trout Axiom SNP array for Mediterranean trout populations’
   characterization. The number of specimens considered in the final
   reference dataset of our study was 68 (i.e., 20 pure native Biferno
   trout, 29 pure native Volturno trout and 19 pure alien trout). Due to
   the large genetic distance occurring between Mediterranean and Atlantic
   trout lineages, we achieved reliable features selection using such
   sample size for each reference population and this is in line with what
   reported by Somenzei et al. [[136]30].

   A total of 633 SNPs was retained after filtering steps. PCA plots
   obtained using quality filtered SNP datasets showed a clear separation
   of native and alien trout populations in both rivers ([137]Figure 1).
   Four statistical methods were used for the identification of
   informative SNP panels (i.e., Delta, F[ST], PCA and RF statistics),
   according to Schiavo et al. [[138]26]. Several approaches have been
   proposed in literature for the identification of population-informative
   markers [[139]40] and it is known that the choice of a specific
   approach can affect the results for a particular population [[140]49].
   As explained by Bertolini et al. [[141]50], the main problems for the
   identification of fully informative SNP markers are due by the high
   level of linkage disequilibrium (LD) that is present in most livestock
   populations. In this regard, it is significant to highlight that a
   supervised machine-learning-based classification approach has been
   demonstrated to be able to partially reduce this problem [[142]25].
   Furthermore, it is noteworthy to highlight that our study involved a
   wild population where the LD could be considered much less extensive
   compared to livestock species. Stability of RF selection was assessed
   implementing a method based on iterations and evaluating the
   frequencies by which SNPs were selected and the mean values of the
   ranking parameters as already proposed by other authors [[143]26]. This
   leaded to two different candidate SNP panels for each RF criterion
   applied. Overall, seven different reduced panels including top ranked
   96 AIMs each were selected by four approaches applied ([144]Table 1).
   Four panels derived using RF by applying GI and EN ranking methods (two
   stability methods were tested for each RF approach). RF methods shared
   a significant high number of top ranked SNPs (an average of 81 out of
   96 SNPs among all applied methods). However, the highest number of
   shared SNPs was detected between F[ST] and RF GI 2 methods (87 SNPs,
   [145]Table 1). Conversely the lowest number of markers (38 SNPs) was
   detected between PCA and RF GI 1 and/or RF EN 1 ([146]Table 1). More in
   general, our results suggested as PCA approach identified a different
   pattern of top ranked SNPs compared to other methods. This might
   reflect the fact that being an unsupervised technique, PCA simply
   exploited the observed variability, as already suggested by Schiavo et
   al. [[147]26]. PCA plots obtained by seven reduced panels ([148]Figure
   S2) suggested as the identified markers could accurately discriminate
   native Mediterranean trout ancestry from alien trout. This is in line
   with what was reported by previous studies where a number of SNPs lower
   than 100 showed reliable results in individual assignment
   [[149]25,[150]26,[151]30,[152]47]. Such consideration was also
   supported by the outcomes of RF analyses applied with the purpose of
   learning a classification rule to assign specimens to the correct
   populations through the seven identified panels ([153]Table 2). This is
   one of the advantages of this machine learning methodology that can be
   applied for both selection and evaluation purposes. Based on these
   statistics, all 96 SNP panels performed quite well. The correct
   prediction proportion in train accuracy for all analysed populations in
   the reference dataset was 100% for all SNP panels ([154]Table 2). In
   the test dataset (which included only 20% of the animals of the entire
   investigated population) a few animals were wrongly classified, but
   correct prediction proportion (test accuracy) was still high (i.e.,
   ≥92%). In particular, the highest value was observed for the SNP panel
   derived using the RF GI 2 and RF EN 2 methods (0.97), whereas the
   lowest for Delta and F[ST] methods (0.92). Performance outcomes
   appeared in line with the fact that there was a general high SNPs
   overlapping between all tested approaches, excluding the Delta and
   F[ST]panels, and this supported the idea that most informative markers
   were effectively selected in our study. This consideration has been
   also supported by the fact that significant low r^2 values were
   estimated between ancestry proportions obtained using the full set of
   SNP and 1000 random reduced panels ([155]Figure S3); whereas high r^2
   across all panels was detected for both rivers when candidate SNP
   panels were tested (r^2 ≥ 0.973; [156]Table 3).

   In total, 47 SNPs resulted in common among all seven identified panels
   and therefore declared as main candidate AIMs. Our results showed that
   such AIMs can accurately discriminate Mediterranean native trout
   ancestry from alien as well. In particular, we assessed the performance
   of such SNPs panel in identifying crosses between native and alien
   trout using both simulated and real data ([157]Figure 2 and [158]Figure
   3; [159]Table S4). As expected, using the AIMs on simulated data
   performed better than on real admixed trout samples, since simulated
   individuals were generated from the same reference populations used to
   select the best AIMs. The mating system applied in simulations
   generated simplified admixture patterns with respect to those occurring
   in real populations. Indeed, real admixed populations presented a more
   complex genetic make-up, influenced by introgression events.
   Performance outcomes of 47 AIMs on reference population were in line
   with expectations (train accuracy 100%, test accuracy 92%; [160]Table
   2). In addition, r^2 was quite high (r^2 was 0.955 and 0.979 for
   Biferno and Volturno rivers, respectively; [161]Table 3). The lower r^2
   detected in Biferno is consistent with what reported by Palombo et al.
   [[162]6], which described a more introgressed scenario in such river.
   Furthermore, the heterogeneous distribution of trout population
   ancestry scores, obtained through 47 AIMs, within each introgression
   class, estimated through the combination of mtDNA and LDH-C1* genotypes
   [[163]33] ([164]Figure S1), suggested that selected AIMs panel could be
   an effective tool to support conservation and monitoring activities
   within LIFE17 NAT/IT/000547 Project. Moreover, considering the growing
   interest in restoring the genetic integrity status of Mediterranean
   trout populations in Italy, the development of a customized multiplex
   PCR panel for simple amplicon sequencing may help to confirm our
   approach outcomes outside Molise rivers.

   Noticeably, the distribution of 47 common AIMs along the genome
   appeared to be heterogeneous, the higher number of identified AIMs were
   harbored on chromosome 12, 19, and 26 ([165]Figure 4). In this regard,
   it is interesting to note that this distribution does not reflect the
   chromosome size, suggesting the possible presence of a selection
   signature even if no interesting genes were identifiable and no
   pathways resulted enriched.

5. Conclusions

   The use of molecular tool to support Brown trout conservation programs
   and management is of paramount importance; however, conventional
   molecular markers are often insufficient to classify the specimens at
   individual level and/or are expensive and time consuming.

   In this work, a SNP-array technology and ML approach were combined for
   the first time to select most informative markers for Atlantic and
   Mediterranean trout identification. A reduced panel of 47 AIMs was
   identified. The high correlations between ancestry coefficients
   calculated using the full set of SNPs and the reduced panel, supported
   the idea that such panel encompassed AIMs with the high discriminant
   capacity.

   Further studies with larger samples size and/or new populations are
   required to corroborate our outcomes outside Molise rivers’ basins and
   to develop a customized multiplex PCR panel to run massive genotyping
   based on simple amplicon sequencing for Mediterranean tout populations
   in Italy. However, the methodology described in this study could be
   useful for the AIMs identification in other species.

Acknowledgments