Abstract

   Gir is one of the main cattle breeds raised in tropical South American
   countries. Strong artificial selection through its domestication
   resulted in increased genetic differentiation among the countries in
   recent years. Over the years, genomic studies in Gir have become more
   common. However, studies of population structure and signatures of
   selection in divergent Gir populations are scarce and need more
   attention to better understand genetic differentiation, gene flow, and
   genetic distance. Genotypes of 173 animals selected for growth traits
   and 273 animals selected for milk production were used in this study.
   Clear genetic differentiation between beef and dairy populations was
   observed. Different criteria led to genetic divergence and genetic
   differences in allele frequencies between the two populations. Gene
   segregation in each population was forced by artificial selection,
   promoting isolation, and increasing genetic variation between them.
   Results showed evidence of selective forces in different regions of the
   genome. A total of 282 genes were detected under selection in the test
   population based on the fixation index (Fst), integrated haplotype
   score (iHS), and cross-population extend haplotype homozygosity
   (XP-EHH) approaches. The QTL mapping identified 35 genes associated
   with reproduction, milk composition, growth, meat and carcass, health,
   or body conformation traits. The investigation of genes and pathways
   showed that quantitative traits associated to fertility, milk
   production, beef quality, and growth were involved in the process of
   differentiation of these populations. These results would support
   further investigations of population structure and differentiation in
   the Gir breed.

Introduction

   Gir is one of the main Bos indicus cattle breeds native from India that
   was first introduced in Brazil in late 1800s. Although Gir is an
   important breed for milk production in tropical South American
   countries, small populations are still found in North America and
   Australia. Few animals (<700) were imported in Brazil between 1870 and
   1962. In 1938, the Brazilian Association of Zebu Cattle Breeders (ABCZ)
   started the registration of Gir cattle [[42]1] and contributed to the
   dissemination of this breed in Brazil. In the 1960s, a number of
   Brazilian breeders started selecting for dual-purpose (milk and meat),
   and others selected only for milk production. Since 1993, the majority
   of the Gir breeders have been selecting mainly for milk yield [[43]1].

   Because of the importance for livestock and conservational systems,
   along with the reduced cost of whole-genome genotyping, genomic studies
   in Gir have become more common in the latest years. Most of the recent
   studies in Gir cattle are related to genetic structure, genetic
   diversity and inbreeding levels based on pedigree data in dairy herds
   [[44]1], whole genome sequencing [[45]2,[46]3] for studying regions
   under selection for environmental adaptation [[47]4], and genetic
   differentiation compared with other breeds [[48]5]. Another study in a
   Gir population identified low levels of taurine introgression during
   the formation of this breed [[49]6]. These studies were relevant for
   understanding the Gir history and exploiting the genetic background of
   the population. In 2013, three selective sweeps were identified and
   overlapped with the ST6GALNAC5 gene in Gir [[50]7]. Recently, Peripolli
   et al. (2018) [[51]8] reported 14 runs of homozygosity (ROH) islands in
   a Brazilian Gir dairy population, including the animals used by
   Utsunomiya et al. (2013) [[52]7]. ROH islands are regions of the genome
   with high levels of homozygosity and have been used to assess regions
   under selective pressure.

   Strong artificial selection through domestication resulted in increased
   diversity among recent cattle populations. Diversity includes variation
   in morphology, physiology, production, and fertility [[53]9]. Assessing
   patterns of genetic variation is of particular interest for studying
   domestication, breed formation, population structure, and consequences
   of selection [[54]10]. Several measures of genetic differentiation have
   been proposed, and one of them is the fixation index (Fst; Wright
   (1951) [[55]11]). Several studies have used Fst as a tool for
   identifying patterns of genetic variation at a locus among populations
   relative to that within populations. Thus, Fst has also been used as a
   test for identifying signatures of selection in populations [[56]11,
   [57]12] by using high-throughput SNP information. In fact, Fst is one
   of the most commonly used metrics for detecting signatures of selection
   in animals.

   Besides Fst, haplotype-based methods have also been used in signatures
   of selection studies. The integrated haplotype score (iHS) proposed by
   Voight et al. (2006) [[58]13] and the cross-population extend haplotype
   homozygosity (XP-EHH) test proposed by Sabeti et al. (2007) [[59]14]
   can help to spot evidence of selection with high detection power.
   Recombination has the power to break down linkage disequilibrium around
   a mutation, decreasing the length of haplotypes on which a mutation is
   located [[60]15]. Sabeti et al. (2002) [[61]16] introduced the extended
   haplotype homozygosity (EHH) to exploit the decay of haplotype
   homozygosity as a function of genetic distance from a focal SNP. Later,
   iHS was proposed as a complementary approach to the EHH method and is
   based on the standardized log ratio of the integrals of the observed
   decay of EHH computed for the ancestral and the derived alleles at the
   focal SNP [[62]13]. The iHS test has been currently applied for
   detecting selection signatures in humans [[63]15], plants [[64]17], and
   domestic animals [[65]4, [66]18, [67]19]. The XP-EHH is also based on
   the EHH method and is used to detect selection footprints between
   populations by comparing their EHH profiles.

   The Fst and the haplotype-based approaches are somehow different. The
   time scale over which selection has been occurred has a major impact on
   the ability of each method to detect evidence of selection. The Fst
   method is best suited for detection in events occurring in the more
   distant past [[68]20] whereas the iHS test is best suited for detection
   in recent positive selection [[69]15]. The XP-EHH is useful for
   detection of entirely or approximately fixed loci [[70]14]. As the Fst,
   iHS, and XP-EHH approaches are complementary, incorporating such
   measures in the study of genetic differentiation and selection would
   strongly contribute to understanding gene flow and genetic makeup of
   dual-purpose (meat and milk production) Gir cattle populations.

   So far, a limited number of studies have been conducted to identify
   population structure and signatures of selection in Gir. O’Brien et al.
   (2015) [[71]6] sampled bulls that were widely used for artificial
   insemination in dairy breeding systems, reflecting only the top animals
   in the breed. A study based on the iHS test did not detect significant
   numbers of signatures of selection in the breed probably because very
   few individuals were used (N = 53). Only the ST6GALNAC5 gene was found
   harboring evidence of selection in Gir [[72]7]. A sample of animals
   selected for different purposes from production herds across the
   country would be useful to understand genetic differentiation and
   selection footprints in the Brazilian Gir populations.

   In this study, we present a comprehensive analysis of population
   structure and signatures of selection in two populations of Gir cattle
   selected either for beef or milk production, using high-throughput
   genomic information. Three methods (Fst, iHS, and XP-EHH) were
   implemented to scan the whole genome of those two populations.
   Afterwards, we performed a functional study of the genes identified
   within the regions harboring signatures of selection in each population
   to explain the biological importance of selection footprints. The
   information from our research can be useful for future GWAS studies,
   conservation, and genetic improvement of the Gir breed.

Material and methods

Data resource

   The experiment was conducted in accordance with animal welfare
   guidelines according to State Law No. 11.977 of the São Paulo state,
   Brazil. All animal procedures were approved by the Ethics and Animal
   Handling Committee of the Instituto de Zootecnia, Nova Odessa, SP,
   Brazil.

   Genotyped animals from two distinct populations were used in this
   study. A set of 273 female Gir were obtained from the Brazilian Program
   of Dairy Gir Genetic Improvement (PNMGL), which is a breeding program.
   These animals belonged to five farms located in Minas Gerais and Sao
   Paulo states. More details of the history and genetic background of
   this population can be found in Santana Jr et al. (2014) [[73]1]. A set
   of 173 males and females were obtained from the Animal Science
   Institute (IZ; Sertaozinho, SP), which started a research program for
   growth traits in indicine breeds in 1976, including Gir.

Differentiation between dairy and beef populations

   Among all the Gir breeding or research programs in Brazil, two are of
   special interest: a) PNMGL aims improving milk production traits; b)
   IZ, which is considered unique, aims improving beef production traits
   in a closed herd scheme. The PNMGL was created in 1985 to conduct
   progeny testing programs [[74]1], where the main objective was to
   identify genetically superior bulls for milk production, based on
   progeny performance. Traits such as milk production, milk composition
   (fat and protein), somatic cell counts, handling (e.g., ease of milking
   and temperament), and body conformation (e.g., ligament last udder,
   rear udder height, rear udder width, and length and diameter of the
   teats) are included in the breeding program. The mean 305-day milk
   yield of the current dairy Gir population is 3000 ± 1500 kg [[75]1].
   Because of the well-designed breeding program, semen of genetically
   proven bulls from PNMGL are commercialized worldwide. Recently, genomic
   information started being used in genetic evaluation of the Brazilian
   Gir cattle.

   The IZ started the first breeding season in 1976 with different lines
   that were introduced in the Gir herd to increase genetic variability
   and avoid problems with inbreeding. Since the first breeding season,
   only animals from this herd were used for mating, making it a closed
   herd. The choice for founders was based on their yearling body weight
   (YBW) at 550 days. The first genetic evaluation was performed in 1981
   [[76]21] and the selection criterion for sires was the final body
   weight at 378 days of age, after 168-day feedlot performance tests.
   Females remained on pasture, and the selection criterion for females
   was their YBW at 550 days of age. For this herd, the generation
   interval was 5.65 years, and genetic gains were 2.88 kg/year for the
   weight of sires after the weight gain test and 2.80 kg/year for female
   considering the adjusted weight at 550 days [[77]22]. Records of milk
   production were not available in this herd, animals were selected for
   growth traits.

SNP array data and quality control

   Animals used in this study were genotyped with the GGP Bovine LDv4
   (Illumina, San Diego, CA) with approximately 33K SNP. Only SNPs located
   in the 29 autosomes remained for the analysis. Quality control was
   performed to ensure high quality of genomic data. Therefore, SNPs were
   removed when they were monomorphic, had a call rate lower than 0.90,
   and had a minor allele frequency (MAF) lower than 0.01, and when the
   difference between expected (Hardy-Weinberg equilibrium) and observed
   allele frequencies at a given locus was greater than 0.15. Samples with
   a call rate lower than 0.90 were also removed. After the quality
   control, a total of 442 animals and 23,275 SNPs were retained for
   further analyses. Filtering the SNPs based on MAF may affect the
   probability of identifying alleles related to selection [[78]23]. For
   this reason, a low threshold for MAF (<0.01) was imposed as a criterion
   for the SNP quality control.

   Imputation of missing SNPs and phasing were performed via Beagle 3.3.2
   [[79]24]. The phased data were annotated with ancestral reference
   alleles, and then the haplotype file and the physical map were used for
   the iHS analysis.

Genetic structure of the population

   To access the Gir population structure and to understand the
   relationship within and between populations in a genomic level,
   Principal Component Analysis (PCA) and Discriminant Analysis of
   Principal Components (DAPC) were performed using the adegenet package
   for the R software [[80]25]. The PCA approach allows classifying
   individuals based on the reduced number of important orthogonal
   principal components (PC) [[81]26]. Each PC relates its eigenvalue that
   describes the amount of total inertia explained on the component. When
   used in the context of genomic analysis of population structure, the
   eigenvalues indicate part of the total genetic variability represented
   by the associated PC. The first PC related to a high amount of inertia
   reproduces the structuring of the genetic data [[82]27]. To perform PCA
   with different subsets of SNP markers, the default options for the
   glPCA function were used, allowing compensating for differences in
   variance among allele frequencies [[83]27].

   The DAPC was applied to select the optimal number of clusters (K),
   among all output clusters from PCA. The choice of K was made on the
   basis of the lowest associated Bayesian Information Criterion (BIC;
   Jombart et al. (2010) [[84]25]). When the optimal number of clusters is
   ambiguous, K increases as long as it resulted in a noticeable
   improvement in BIC. DAPC uses K-means clustering of PC in order to
   infer the actual number of populations [[85]25]. The DAPC was run using
   100 PC and 10 discriminant functions.

Divergent selection between populations

   To verify the genetic divergence between two populations, genetic
   statistics such as observed heterozygosity (Ho), genetic diversity
   within population (Hs), total genetic diversity (Ht), coefficient of
   inbreeding (Fis), and fixation index (Fst) were computed using the
   HierFstat R package [[86]28]. The fixation index (Fst), as defined in
   Nei (1987) [[87]29], was used as a measure of genetic differentiation
   between populations being derived from the equation:
   [MATH: <mi>F</mi><mi>s</mi><mi>t</mi><mspace
   width="2pt"></mspace><mo>=</mo><mspace
   width="2pt"></mspace><mfrac><mrow><mi>H</mi><mi>t</mi><mo>-</mo><mi>H</
   mi><mi>s</mi></mrow><mrow><mi>H</mi><mi>t</mi></mrow></mfrac> :MATH]

   Fst quantifies differences in allele frequencies between populations,
   and theoretically its value ranges from zero to one, meaning that there
   is no differentiation or complete differentiation in which
   subpopulations are fixed for different alleles, respectively. Fst
   values are effective for identifying selection signatures between
   different groups, i.e., loci in which alleles are fixed differently in
   different groups [[88]30] and allow determining how the divergent
   selection can affect the genomic pattern of these groups [[89]31].

   As a way to compare Fst values for each SNP, measures of centrality and
   dispersion such as mean and standard deviation were considered. The
   negative Fst values observed for 2,794 SNPs were set to zero, since
   negative values have no biological interpretation [[90]32]. Loci were
   plotted relative to their physical position within each autosome. The
   threshold to call an SNP outlier was defined as three standard
   deviations above the mean. This methodology allows identifying SNPs
   with Fst values that stand out from the others, and that could be
   related to genes affecting adaptive and/or economical important traits
   for beef or milk production, meaning evidence of selection signatures.
   Similar approaches have been used in other studies for identifying
   selection signatures [[91]12, [92]31]. A control chart approach allows
   the partition of Fst variation into a component due to selection and
   also allows the discovery of significant SNPs based on control limits
   set at three standard deviations from the mean [[93]12]. Kijas et al.
   (2012) [[94]10] and Zhao et al. (2015) [[95]31] used the top 0.1% Fst
   values to represent selection signature.

   The approach adopted in the present study is a simplistic technique
   that allows to access the variation patterns on many loci and considers
   values above the cut-off as evidence of divergent selection. Assuming
   that the data follow a normal distribution, a Fst value above three
   standard deviations from the mean indicate that the locus has a value
   higher than 99.8% of total SNPs in the chromosome.

Within population test

   The iHS test was used to detect strong footprints of selection within
   the studied population. This test is based on the standardized log
   ratio of integrals of the observed decay of extended haplotype
   homozygosity (EHH), computed for both ancestral and derived alleles at
   the focal SNP.

   Phased haplotypes produced by Beagle 3.3.2 and ancestral allele
   information were submitted to iHS-based tests. For each SNP, ancestral
   and derived alleles were determined according to the 50K SNP annotation
   file provided in Rocha et al. (2014) [[96]33], which included
   information of putative ancestral alleles of bovine SNPs. The number of
   SNPs in common between the GGP LDv4 SNP chip and the 50K SNP was
   26,058. The phased haplotypes were annotated with ancestral allele
   information (via R package rehh 2.0 [[97]15]). For each allele in dairy
   and beef populations, information was defined as missing if the code
   was incompatible with ancestral and derived alleles. The iHS score was
   computed for each autosomal SNP using the R package rehh 2.0 [[98]17].
   Default options were generally used, except for the minimal threshold
   on the minor allele frequency that was set to 0.01 and the percentage
   of retained haplotypes that were changed to suit the Gir haplotype
   data. In addition, the standardization of iHS was performed with allele
   frequency bins of 0.01, which is controlled by the ihh2ihs function in
   the package [[99]15].

   According to Voight et al. (2006) [[100]13], the iHS of a given focal
   SNP (iHS) was defined as
   [MATH: <mi>i</mi><mi>H</mi><mi>S</mi><mspace
   width="2pt"></mspace><mo>=</mo><mspace
   width="2pt"></mspace><mfrac><mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi>H</
   mi><mi>S</mi><mo>-</mo><mspace
   width="2pt"></mspace><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>U</mi><m
   i>n</mi><mi>i</mi><mi>H</mi><mi>S</mi></mrow><mrow><msub><mrow><mi>p</m
   i></mrow><mrow><mi>s</mi></mrow></msub></mrow></msubsup></mrow><mrow><m
   subsup><mrow><mi>σ</mi></mrow><mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi>H
   </mi><mi>S</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>s</m
   i></mrow></msub></mrow></msubsup></mrow></mfrac> :MATH]

   where UniHS = log(iHH[ancestral]/iHH[derived]) (where iHH is the
   integrated allele-specific extended haplotype homozygosity for core SNP
   alleles (ancestral and derived)), and
   [MATH:
   <msubsup><mrow><mi>μ</mi></mrow><mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi
   >H</mi><mi>S</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>s<
   /mi></mrow></msub></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi
   >H</mi><mi>S</mi></mrow><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>s<
   /mi></mrow></msub></mrow></msubsup> :MATH]
   are the average and the standard deviation of the UniHS, respectively,
   computed over all the SNPs with a derived allele frequency p[s] similar
   to that of the core SNPs. The iHS was constructed to approximately
   follow a standard Gaussian distribution and to enable comparisons among
   SNPs regardless of their underlying allele frequencies. The iHS is
   transformed into piHS, as shown by [[101]34]:
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub><mi>H</mi><m
   i>S</mi><mspace width="2pt"></mspace><mo>=</mo><mspace
   width="2pt"></mspace><mo>-</mo><msub><mrow><mi>l</mi><mi>o</mi><mi>g</m
   i></mrow><mrow><mn>10</mn></mrow></msub><mo>(</mo><mrow><mn>1</mn><mo>-
   </mo><mn>2</mn><mo>|</mo><mrow><mi>Φ</mi><mo>(</mo><mrow><mi>i</mi><mi>
   H</mi><mi>S</mi></mrow><mo>)</mo><mo>-</mo><mn>0.5</mn></mrow><mo>|</mo
   ></mrow><mo>)</mo> :MATH]

   where Φ(x) represents the Gaussian cumulative distribution function.
   Assuming that most of the iHS values follow the Gaussian distribution
   under neutrally, piHS may be interpreted as a two-sided p-value in a
   -log10 scale [[102]15].

   To control for false positives, the fndr.cutoff function available in
   the fdrtool R package [[103]35] was used with its default options for
   calculating a p-value, which defines the cut-off point chosen according
   to the false non-discovery rate. After the false discovery rate (FDR)
   adjustment within population, the genome-wide significance level was
   equal to approximately 0.006 and 0.008 for dairy and beef cattle
   populations, respectively. This methodology has been used to identify
   regions displaying strong footprints of selection in humans [[104]15],
   cattle [[105]31, [106]36] and other animal populations [[107]18].

Cross-population test

   To compare EHH profiles between the two populations, we used the XP-EHH
   statistics. The main idea of XP-EHH is to test if the genome site is
   homozygous in one population but polymorphic in another population
   through the comparison of EHH score of two populations on one core SNP
   [[108]37]; ancestral information is not needed when performing this
   test. In our analysis, the genome of the beef population, as an
   observed population, was compared with the dairy population, as a
   reference population. The XP-EHH score was computed for each autosomal
   SNP using the same R package used to compute iHS, rehh 2.0 [[109]15].

   The XP-EHH of a given focal SNP (XP-EHH) was defined and standardized,
   according to Sabeti et al. (2007) [[110]14] and Gautier et al. (2017)
   [[111]15] as
   [MATH:
   <mi>X</mi><mi>P</mi><mo>-</mo><mi>E</mi><mi>H</mi><mi>H</mi><mspace
   width="2pt"></mspace><mo>=</mo><mspace
   width="2pt"></mspace><mfrac><mrow><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">R</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi><mo>-</mo><msub><mrow><mi>m</mi><mi>e</mi><m
   i>d</mi></mrow><mrow><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">R</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi></mrow></msub></mrow><mrow><msub><mrow><mi>σ
   </mi></mrow><mrow><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">R</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi></mrow></msub></mrow></mfrac> :MATH]

   where LRiES = log(iEH[population1]/iEH[population2]); iEH is the
   integrated allele-specific extended haplotype homozygosity; med[LRiES]
   and σ[LRiES] are the median and the standard deviation of the LRiES,
   respectively. Populations 1 and 2 are considered as a reference
   population and an observed population, respectively. The XP-EHH is
   transformed into p[XP-EHH] as shown by Gautier et al. (2011) [[112]34]:
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>X</mi><mi>P</mi><mo>-</mo><mi>E<
   /mi><mi>H</mi><mi>H</mi></mrow></msub><mspace
   width="2pt"></mspace><mo>=</mo><mspace
   width="2pt"></mspace><mo>-</mo><msub><mrow><mi>l</mi><mi>o</mi><mi>g</m
   i></mrow><mrow><mn>10</mn></mrow></msub><mo>(</mo><mrow><mn>1</mn><mo>-
   </mo><mn>2</mn><mo>|</mo><mrow><mi>Φ</mi><mo>(</mo><mrow><mi>X</mi><mi>
   P</mi><mo>-</mo><mi>E</mi><mi>H</mi><mi>H</mi></mrow><mo>)</mo><mo>-</m
   o><mn>0.5</mn></mrow><mo>|</mo></mrow><mo>)</mo> :MATH]

   where Φ(x) represents the Gaussian cumulative distribution function. As
   p[XP-EHH] may be interpreted as a two-sided p-value in a -log10 scale,
   regions with a p-value lower than 0.01 (1%) were considered signatures
   of selection in the test populations. Negative XP-EHH scores suggest
   selection happened in the reference population, otherwise happened in
   the observed population.

Candidate genes and functional analysis

   A genomic region was considered as being under selection if it
   contained significant SNPs based on Fst or iHS values. Windows of 500
   kb (250 kb upstream and 250 kb downstream of the significant SNP) were
   investigated to verify overlapping gene segments. Additionally, genes
   were compared with QTL regions previously identified and present in the
   Cattle QTL database
   ([113]https://www.animalgenome.org/cgi-bin/QTLdb/BT/search). Gene
   annotation was performed using the UMD3.1 bovine genome assembly from
   the BioMart ([114]www.ensembl.org/biomart) and NCBI
   ([115]https://www.ncbi.nlm.nih.gov) databases. The orthologous genes
   from primates and sheep were used when the annotation information for
   bovine genes was not available.

   A database for Annotation, Visualization, and Integrated Discovery
   (DAVID) v6.8 tool [[116]38, [117]39] was used to identify significant
   (p < 0.05) Gene Ontology (GO) terms and KEGG (Kyoto Encyclopedia of
   Genes and Genomes) pathways using a list of genes with significant SNPs
   based on Fst and iHS values and the Bos taurus annotation file as a
   reference genome.

Results and discussion

   Eigenvalues of the PCA are shown in [118]Fig 1a. The genetic structure
   of the dataset, i.e. the variance of the data, was captured mainly by
   the first PC. The eigenvalue for the first PC was 134.11, and the
   eigenvalue for the second PC was 39.16 ([119]Fig 1a). Although
   eigenvalues are absolute variances of the corresponding PC, it is
   common to express them as a percentage of the total variation in the
   data. In this case, percentages of variance explained by first and
   second PCs were 6.79 and 1.98, respectively. The percentage of the
   variance explained by PC was calculated as the eigenvalue times 100
   divided by the sum of all the eigenvalues.

Fig 1. Genetic structure of the Brazilian Gir population.

   [120]Fig 1
   [121]Open in a new tab

   (a) Eigenvalues of principal component analysis (PCA). (b) PCA
   scatter-plots of the first two principal components (PC) showing
   clearly separation between beef population (green color) and dairy
   population (red-yellow colors). (c) Inference of the number of clusters
   in Gir cattle based on K-means algorithm. (d) Plots of the first two
   discriminant functions of discriminant analysis of PC algorithm.

   In [122]Fig 1b, it is clear that both groups were completely separated
   according to the first principal component (PC1). The obtained results
   are clear and consistent with the history of both populations, which
   were under different breeding programs; therefore, were subjected to
   the intentional segregation of genes within each population, promoting
   the complete isolation and genetic variation between them. The dairy
   population presented a more scattered cluster based on the second
   component (PC2), indicating higher genetic distance and lower
   relatedness between individuals of the population, compared with the
   beef population cluster. Because animals from the dairy population were
   from commercial herds located in different regions of Brazil, it is
   possible that the PCA analysis clustered individuals of the dairy
   population according to their different demographic histories,
   providing a more scattered pattern. The proximity between animals from
   the beef population could be due to the fact that it was characterized
   as a closed herd in which only breeding animals of this herd were used
   in the mating. Mating of related animals favored the increase of
   inbreeding and reduced the genetic difference among individuals of this
   group.

   Results of PCA were based on the clustering of individuals into
   geographically pre-defined groups. Discriminant Analysis of Principal
   Components based on the K-means algorithm, implemented in the R package
   adegenet [[123]25] was further used to test the greatest number of
   clusters and to better evaluate the hypothesis on the two populations.
   An inspection of BIC values when the number of clusters (K) ranged from
   1 to 100 showed that two clusters should be considered ([124]Fig 1c);
   given this was the number with the lowest BIC. This test reflects the
   minimum number of clusters after which the BIC increases or decreases
   by a negligible amount [[125]25]. The estimation of K by the DAPC has
   been markedly successful for the inference of the exact number of
   clusters in recent population genetic studies [[126]40, [127]41]. The
   plots of the first discriminant function of DAPC with different colors
   for different groups provided a visual assessment of between-population
   genetic structure ([128]Fig 1d). In the present study, DAPC can also be
   considered well suited to describe the genetic diversity of the
   genotyped animals in the Brazilian Gir population.

   According to PCA and DAPC analyses, there was a clear genetic
   difference between beef and dairy populations, and this division
   reflects the variation existent between the populations. Because of
   that, the next step was to compute genetic statistics taking into
   account the stratification of the dataset into beef and dairy
   populations. Average heterozygosity by chromosome was different between
   the two populations ([129]Fig 2). The greatest discrepancy between
   means was observed on chromosome 23 (0.06). The allele
   frequency-dependent diversity estimate, such as observed heterozygosity
   (H[o]), is a measure of genetic variation and can be very useful in
   comparison between populations [[130]10]. The dairy population
   displayed higher heterozygosity levels compared with the beef
   population, which most likely reflects relative levels of genetic
   diversity. The highest heterozygosity level was observed on chromosome
   17 (0.38), and the exception was for chromosome 20, where the beef
   population displayed the higher mean. Fis was slightly negative for
   dairy (-0.008) and beef (-0.0322) populations. Negative values of Fis
   could indicate an excess of heterozygosity beyond that expected under
   Hardy-Weinberg equilibrium within those populations. Negative values of
   Fis in Gir and other cattle breeds have been reported in the literature
   [[131]6, [132]42].

Fig 2. Average heterozygosity per chromosome in two populations.

   [133]Fig 2
   [134]Open in a new tab

   Light grey = beef cattle; dark grey = dairy cattle.

   Genetic differentiation varied throughout the genome ([135]Fig 3). The
   average value for Fst was 0.033. Although the two populations in the
   present study were selected for different purposes, some levels of
   proximity between them were expected and were indicated by the low Fst
   values. In addition, the average Fst value was close to that observed
   in different cattle breeds [[136]9, [137]30].

Fig 3. Genomic distribution of Fst values.

   [138]Fig 3
   [139]Open in a new tab

   SNPs were plotted relative to their physical positions within each
   autosome. The cutoff to call SNP outliers was defined as three standard
   deviations above the mean for each autosome. Red dots are SNPs with Fst
   beyond the cutoff value.

   The level of genetic differentiation between populations is expected to
   be low in neutral regions of the genome or in regions of balanced
   selection, and divergent in regions subject to directional selection.
   When a Fst value is zero, there is no genetic differentiation between
   the populations under comparison. In the present study, low Fst values
   were possibly identified because the two populations belonged to the
   same breed and were originated from the same recent ancestral
   population; therefore, many alleles were expected to be commonly fixed
   in both populations. Gir is originally a dual-purpose breed. During the
   1960s, part of the Brazilian breeders were artificially selecting
   animals for meat and milk simultaneously [[140]1], which could explain
   the expected levels of proximity and low Fst values for the two
   populations. These populations only started being selected for
   different purposes in the last decades. Under artificial selection,
   allele frequencies can shift over time in the direction of the desired
   phenotypes producing signatures of selection.

   The Fst approach enables the detection of selection signatures based on
   differences in allele frequencies across populations. This method
   revealed 488 SNPs as relevant loci for divergent selection among the
   total number of loci evaluated. Although being from the same breed,
   some changes in allele frequencies were observed between the two
   populations ([141]Fig 4). This change in allele frequencies and the
   genetic diversity observed were possibly caused by recent selection for
   different criteria and other non-reported genetic events that occurred
   in the past. These findings are supported by the population structure
   results. The higher allele frequencies of these SNPs are representative
   of differences in selection, neutrality or other processes used in
   breeding programs [[142]40].

Fig 4. Scatter plots for population-specific allele frequency (dairy x beef).

   [143]Fig 4
   [144]Open in a new tab

   The data were displayed as a collection of points; each point
   represents an SNP having the allele frequency for the dairy population
   determining the position on the vertical axis and the allele frequency
   for the beef population determining the position on the horizontal
   axis. It reveals a positive linear relationship between allele
   frequencies of two populations.

   The SNPs identified as outliers based on the Fst method can be strong
   evidence of signatures of selection. Four SNPs had Fst values higher
   than 0.4; they were located on chromosomes 5, 7, 10, and 26. In total,
   19 SNPs had Fst values ranging from 0.3 to 0.4. The greatest number of
   outliers was obtained on chromosome 1 (n = 33, with Fst values ranging
   from 0.15 to 0.32) and the lowest was obtained on chromosome 25 (n = 7,
   with Fst values ranging from 0.20 to 0.38) ([145]S1 Fig).

   Considering the threshold of three standard deviations to report an
   outlier Fst, the SNPs elected as outliers had Fst values higher than
   99.8% of the total SNPs in the chromosome, which identified a variation
   pattern within the chromosome. Cadzow et al. (2014) [[146]20]
   acknowledged that distinct methods use different patterns of genetic
   variation to identify evidence of selection. In addition, they
   highlighted the importance of utilizing multiple methodologies for
   investigating selection among populations. Thus, the iHS test was used
   as a complementary approach to identify regions that exhibit evidence
   of selection. This is a linkage disequilibrium-based method that
   provides increased power for assessment of signatures of selection
   within the population using high-throughput molecular information
   (e.g., SNP arrays).

   After adjustment for FDR within the population, SNPs displaying piHS
   values greater than or equal to 2.10 (which approximately corresponded
   to p-value <0.008) were considered significant for the beef population.
   For the dairy population, the cutoff value was 2.20 (which
   approximately corresponded to p-value <0.006). The "fdrtool" package
   allows the computation of local FDR values from p-values while taking
   into account an empirical null model [[147]35]. The application of the
   FDR adjustment within each population avoids the detection of false
   positive selection signatures [[148]31]. The chromosome-wide scans of
   iHS for beef and dairy populations are shown in [149]Fig 5. The plots
   show clear evidence of selective forces in different regions of the
   genome. A total of 82 and 129 SNPs harbored signatures of selection in
   beef and dairy populations, respectively. The most significant SNP
   mapped to chromosome 6 (95,171,308 bp) for the beef population (iHS =
   -4.22) and to chromosome 16 (25,933,132 bp) for the dairy population
   (iHS = -4.43).

Fig 5. Genome-wide distribution of |iHS| values for Gir cattle.

   [150]Fig 5
   [151]Open in a new tab

   Upper = beef population; bottom = dairy population.

   To further identify regions displaying strong signatures of selection,
   500 kb windows containing the significant SNPs were investigated. A
   total of 23 and 43 candidate genes overlapped with significant windows
   for beef and dairy populations, respectively. The description of the
   top ten most significant iHS genomic regions per population is shown in
   [152]Table 1. The SLC24A4 (solute carrier family 24 member 4) gene
   located on chromosome 21 was common in the two populations, which can
   be clearly confirmed by the iHS plots ([153]Fig 5). It has been shown
   that the SLC24A4 gene belongs to a group of the potassium-dependent
   sodium/calcium exchanger proteins which has been associated with hair,
   skin and eye pigmentation in humans [[154]43, [155]44]. Nayeri et al.
   (2016) [[156]45] conducted a genome-wide association study (GWAS) and
   identified the SLC24A4 gene in a significant region associated with
   success in fertility traits such as calving interval and days open in
   female Canadian dairy Holstein cattle.

Table 1. Top ten significant iHS genomic regions harboring signatures of
selection in beef and dairy Gir cattle.

   Population Position[157]^a Peak position iHS[158]^b piHS[159]^c
   Candidate gene (position)
   Beef Chr4:35784894–35785394 35785144 -3.42 3.20 SEMA3D
   (35563538–35793762)
   Chr5:4589482–14589982 14589732 3.44 3.23 SLC6A15
   (14572186–14629483)
   Chr6:88541997–88542497 88542247 -3.92 4.05 n.a
   Chr6:88936051–88936551 88936301 -3.61 3.51 n.a.
   Chr6:90771212–90771712 90771462 3.97 4.14 LOC101906736[160]^d
   (90743814–90771489)
   Chr6:93023933–93024433 93024183 3.57 3.45 CCDC158
   (92975081–93058373)
   Chr6:93906353–93906853 93906603 3.79 3.82 n.a.
   Chr6:95171058–95171558 95171308 -4.22 4.62 n.a.
   Chr6:98735539–98736039 98735789 -3.51 3.34 n.a.
   Chr11:75228982–75229482 75229232 -3.47 3.29 ATAD2B
   (75141063–75263986)
   Dairy Chr3:33004500–33005000 33004750 3.86 3.95 n.a.
   Chr13:41168962–41169462 41169212 -3.85 3.93 n.a.
   Chr14:41835997–41836497 41836247 3.68 3.63 n.a.
   Chr15:37760348–37760848 37760598 4.00 4.20 LOC104974236[161]^d
   (37754671–37769288)
   Chr15:37871231–37871731 37871481 4.08 4.34 n.a.
   Chr15:44711168–44711668 44711418 3.72 3.70 STK33
   (44548054–44742390)
   Chr15:52440542–52441042 52440792 -3.69 3.65 n.a.
   Chr16:25932882–25933382 25933132 -4.43 5.03 DUSP10
   (25895748–25936856)
   Chr19:56552331–56552831 56552581 -4.21 4.59 RECQL
   (556549762–56580017)
   Chr21:66999575–67000075 66999825 4.32 4.81 WDR25
   (66917744–67066001)
   [162]Open in a new tab

   n.a., not available.

   ^a All positions are given in base pairs (bp) according to the Bovine
   UMD3.1 assembly.

   ^b iHS is a value of integrated haplotype score.

   ^c piHS is a value of transformed integrated haplotype score.

   ^d non-coding RNA.

   The description of the top ten most significant XP-EHH genomic regions
   per population is shown in [163]Table 2. The most significant SNP was
   mapped to chromosome 22 (58,886,462 bp) for the beef population (XP-EHH
   = 2.44) and to chromosome 21 (57,721,222 bp) for the dairy population
   (XP-EHH = -7.02). We detected 28 SNPs as candidates of selection
   footprints in the beef population when the dairy population was used as
   a reference in the XP-EHH test, but only 10 out of 28 SNPs were mapped
   in gene regions. For the beef population, seven genes (DNAH7, FBLN2,
   HECW2, SLC8A2, TRRAP, UGT1A1, and WNT7A) were identified based on the
   XP-EHH statistics. Two SNPs were located in the region of the UGT1A1
   gene, and two SNPs were located in the region of the WNT7A gene. The
   UGT1A1 gene was located on chromosome 3 and encodes a critical enzyme
   that transforms small lipophilic molecules, such as steroids, hormones,
   and drugs, into excretable metabolites maintaining homeostasis
   [[164]46]. The WNT7A gene regulates several cells and developmental
   pathways that affect the development of female reproductive tract and
   maintain uterine function in adults [[165]47]. The WNT7A gene enhances
   muscle regeneration, increases the satellite stem cells and stimulates
   myogenic stem cell motility [[166]48, [167]49]. Xue et al. (2013)
   [[168]50] identified polymorphisms in the WNT7A gene associated with
   growth traits in Chinese Qinchuan cattle.

Table 2. Top ten significant XP-EHH genomic regions harboring signatures of
selection in beef and dairy Gir cattle.

   Population Position[169]^a Peak position XP-EHH[170]^b pXP-EHH[171]^c
   Candidate gene (position)
   Beef Chr5:62492901–62493401 62493151 2.20 1.56 n.a.
   Chr15:80280407–80280907 80280657 2.24 1.60 n.a.
   Chr18: 54862897–54863397 54863147 2.33 1.71 SLC8A2
   (54853503–54881796)
   Chr18: 54955789–54956289 54956039 2.19 1.55 n.a.
   Chr19:11998622–11999122 11998872 2.22 1.58 n.a.
   Chr19:12009008–12009508 12009258 2.29 1.66 n.a.
   Chr22:58836201–58836701 58836451 2.28 1.65 WNT7A
   (58809372–58873170)
   Chr22:58886212–58886712 58886462 2.44 1.83 n.a.
   Chr22:58896176–58896676 58896426 2.22 1.58 n.a.
   Chr25:38067748–38068248 38067998 2.17 1.53 n.a.
   Dairy Chr21:57531747–57532247 57531997 -6.66 10.56 n.a.
   Chr21:57532846–57533346 57533096 -6.72 10.75 SLC24A4
   (57596461–57783306)
   Chr21:57619270–57619770 57619520 -6.64 10.51 SLC24A4
   (57596461–57783306)
   Chr21:57619838–57620338 57620088 -6.67 10.61 SLC24A4
   (57596461–57783306)
   Chr21:57720972–57721472 57721222 -7.02 11.65 SLC24A4
   (57596461–57783306)
   Chr21:57723114–57723614 57723364 -6.90 11.29 SLC24A4
   (57596461–57783306)
   Chr21:57723895–57724395 57724145 -6.84 11.11 SLC24A4
   (57596461–57783306)
   Chr21:57726974–57727474 57727224 -6.79 10.96 SLC24A4
   (57596461–57783306)
   Chr21:57729925–57730425 57730175 -6.81 11.02 SLC24A4
   (57596461–57783306)
   Chr21:57730971–57731471 57731221 -6.65 10.54 SLC24A4
   (57596461–57783306)
   [172]Open in a new tab

   n.a., not available.

   ^a All positions are given in base pairs (bp) according to the Bovine
   UMD3.1 assembly.

   ^b XP-EHH is a value of cross-population extend haplotype homozygosity
   score.

   ^c pXP-EHH is a value of transformed cross-population extend haplotype
   homozygosity score.

   For the dairy population, 655 negative values presented p-values higher
   than 0.01, but only 197 out of 655 outliers were mapped to gene
   regions. When searching for genes, only 94 were identified for the
   dairy population. The SLC24A4, DUSP10, CTNNA2, ADAMTS3, and TTC12 genes
   were located in regions with the highest number of SNPs identified per
   window, being 29, 12, 9, 7, and 7 SNP per region, respectively. The
   SLC24A4 were also detected with iHS for the beef and dairy population,
   and the DUSP10 gene was also detected with iHS approach. The CTNNA2
   gene was detected with the Fst, iHS and XP-EHH approaches.

   The DUSP10 gene located on chromosome 16 negatively regulates the
   activation of mitogen-activated protein (MAP) kinases and has a
   principal function in both innate and adaptive immune responses as an
   inhibitor of inflammation [[173]51]. This gene has been described
   playing an important role in regulating the balance of energy thought
   the control of brown adipocyte differentiation [[174]52]. Huang et al.
   (2017) [[175]53] comparatively analyzed the transcriptome of
   subcutaneous adipose tissue between Wagyu and Holstein breeds with
   difference in fat deposition and identified the DUSP10 gene
   up-regulated in Wagyu, which could be a key gene associated with fat
   metabolism and adipogenesis. The CTNNA2 (catenin alpha 2) gene codifies
   a protein that plays an important role in the catabolism of collagen,
   cell adhesion and myogenesis [[176]54]. This gene was identified in a
   QTL region associated with Lipomatous Myopathy in Piedmontese beef
   cattle, a disease characterized by degeneration and infiltration of the
   muscular tissue characterized by replacement of myofibers with adipose
   tissue [[177]55].

   The ADAMTS3 gene was mapped on chromosome 6 and encodes a member of the
   procollagen aminopropeptidase subfamily of proteins that play a role in
   the processing of type II fibrillar collagen in articular cartilage
   which is crucial for embryonic development and regulates placental
   angiogenesis [[178]56]. Mészáros et al. (2014) [[179]57] performed a
   genome-wide association study in Fleckvieh bulls and identified the
   ADAMTS3 gene associated with longevity.

   In total, 157 out of 488 significant Fst values were located in 151
   regions. Three SNPs with significant Fst values overlapped with the
   FCHSD2 (FCH and double SH3 domains 2) gene, which is located on
   chromosome 15 and regulates cell growth, migration and adhesion
   [[180]58]. Two Fst peaks were within the genomic regions of NSG1
   (neuronal vesicle trafficking associated 1), RALGPS1 (Ral GEF with PH
   domain and SH3 binding motif 1), and HS3ST5 (heparan
   sulfate-glucosamine 3-sulfotransferase 5) genes, which are located on
   chromosomes 6, 11 and 9, respectively.

   The NSG1 gene encodes a small transmembrane protein highly expressed
   specifically in neurons which plays a critical role in the trafficking
   and polarization of several proteins [[181]59–[182]61]. Lee et al.
   (2016) [[183]62] reported the NSG1 gene as a region containing highly
   selective SNPs, i.e. SNPs that have higher probability of being
   selected in the next generation for milk production traits (milk yield
   and fat and protein contents) in Holsteins.

   The HS3ST5 gene encodes a member of the heparan sulfate
   3-O-sulfotransferases protein family that catalyzes the biosynthesis of
   heparan sulfate, which regulates blood coagulation [[184]63]. This gene
   has been associated with reproductive seasonality in sheep (total days
   of anoestrus and progesterone cycling months) [[185]64].

   Three genes (CTNNA2, SLC24A4, and TMEM117) were detected based on more
   than one approach. In the dairy population, three genes (CNTN3, STK33,
   and TMEM117) were detected with Fst and iHS, four genes (DUSP10, NCAM1,
   SLC24A4, and TMEM117) were detected with iHS and XP-EHH, and thirteen
   genes (AFP, ANKRD17, CSNK1G1, C4orf22, CTNNA2, GALNT14, MAP1A,
   TBC1D22B, TMEM117, TMPRSS11E, UNC79, VPS13C, and ZFAND3) were detected
   with Fst and XP-EHH methods. Eight genes (ADAMTS3, BCL11A, CAPN13,
   CTNNA2, EPHA5, GC, SLC24A4, and SLC25A21) were detected based on the
   iHS for the beef population and XP-EHH for the dairy population.

   The CNTN3 (contactin 3) gene located on chromosome 22 has been
   described as a candidate gene for tenderness in Nelore beef cattle
   [[186]65]. The STK33 (serine/threonine kinase 33) gene located on
   chromosome 15 encodes a member of the calcium/calmodulin-dependent
   kinase family, which exhibits a non-ubiquitous and a low level of
   expression in most tissues [[187]66]. The TMEM117 (transmembrane
   protein 11) located on chromosome 5 codifies an uncharacterized
   multi-pass transmembrane protein. Veerkamp et al. (2012) [[188]67]
   performed a GWAS for feed utilization complex in Holstein–Friesian
   dairy cows and identified an SNP in TMEM117 gene with a large effect in
   body condition scores. In another study also with GWAS, Zhu et al.
   (2017) [[189]68] identified this gene associated with saturated fatty
   acids composition in Simmental cattle. The NCAM1 was mapped on
   chromosome 15 and plays important functional roles in cell migration
   and plasticity changes in the developing and adult nervous system
   [[190]69], insulin signaling and adipocyte differentiation in mouse
   [[191]70].

   A total of 282 genes were detected under selection in the dairy and
   beef Gir populations based on Fst, iHS, and XP-EHH approaches (see
   [192]S1 File for details). Chen et al. (2016) [[193]9] did not find
   common candidate regions in Chinese Holstein and Simmental cattle
   populations by using Fst and XP-EHH methods. The authors attributed it
   to different features of these methods and recommended the integration
   of various methods to increase the detection sensitivity of signatures
   of selection. As Fst, iHS, and XP-EHH tests assume different
   methodologies, more confidence is provided when a common genomic region
   is obtained by various methods. Fst, iHS, and XP-EHH must be used as
   complementary approaches in the detection of signatures of selection.

   We compared the gene list from each test with previous regions of
   selection signatures in Gir. Three significant SNPs harboring
   signatures of selection based on the iHS were identified for Gir dairy
   cattle by Utsunomiya et al. (2013) [[194]7]; only one of the SNPs
   identified by the authors was located in the intergenic region of the
   ST6GALNAC5 gene.

   The ST6GALNAC5 gene was not identified in our analysis. However, six
   genes identified in our study (ARHGEF7, DNAH7, EPB41, RELL1, TRAPPC9
   and ZFHX4) were located within the ROH islands reported by Peripolli et
   al. (2018) [[195]8] in Brazilian Gir dairy cattle. The ROH islands are
   genomic regions with high homozygosity around a selected locus, i.e.,
   with reduced genetic diversity. These islands of the genome might
   harbor targets of positive selection.

   Next, we compared our gene list with QTL regions previously reported in
   cattle and available for online search. In total, 35 genes with
   signatures of selection overlapped with QTL regions. Mapping LCORL and
   NCAPG genes to the cattle QTL database showed that both genes were
   associated with reproduction, growth, and meat and carcass traits. They
   are all located on chromosome 6. A number of QTL terms have been
   reported for cattle in the region of NCAPG and LCORL genes [[196]71,
   [197]72]. The LCORL and NCAPG were identified based on iHS test in the
   dairy population.

   Other genes with selection signatures that had overlapped with QTL
   terms were found being associated with production (KCNIP4, ANXA4, FTO,
   EGFR, and PAK1), reproduction (VAV3, KCNIP4, FRAS1, C4orf22, TRAPPC9,
   and APBB1), milk composition (ROBO1, NRCAM, KCNIP4, DAAM1, TRAPPC9,
   NF2, and FTO), meat and carcass (FAM184B, NCAM1, NF2, CTSD, TOX, and
   VSTM2L), health (TMTC2, IL1RN, BFSP1, TOX, CNTN3, and TRAPPC9), and
   body conformation (GC, ANXA4, and NF2). More details can be found in
   the supplementary [198]S1 File. The above findings suggest that these
   genes might play an important role for traits that have been selected
   in both dairy and beef populations.

   The dataset used for DAVID v.6.8 comprised 142, 23, 43, and 94 genes
   identified based on Fst, iHS beef, iHS dairy, and XP-EHH dairy tests,
   respectively. Regarding to the XP-EHH beef test analysis, only seven
   genes (DNAH7, FBLN2, HECW2, SLC8A2, TRRAP, UGT1A1, and WNT7A) were
   identified in those genomic regions, which is not suitable for a
   functional analysis.

   The significant (P<0.05) GO terms (biological process, cellular
   component, and molecular function) and KEGG pathways identified with
   Fst, iHS dairy, and XP-EHH dairy tests are described in [199]Table 3.
   No significant terms were detected to the iHS beef cattle analysis,
   probably due to the small number of genes on this dataset. Results from
   the functional enrichment analysis with the Fst test revealed a total
   of six KEGG pathways, four GO cellular components, six GO biological
   processes, and one GO molecular function ([200]Table 3).

Table 3. Gene Ontology terms and KEGG pathways enrichment analysis.

   Test Term P Genes
   Fst KEGG Pathway
   bta04530:Tight junction 0.0150 MAGI3, EPB41, ACTN1, AMOTL1, TJP2
   bta04810:Regulation of actin cytoskeleton 0.0160 EGFR, ARHGEF4, VAV3,
   ARHGEF7, DIAPH3, ACTN1
   bta04970:Salivary secretion 0.0200 KCNMA1, PLCB4, ADCY9, PRKG2
   bta04540:Gap junction 0.0240 EGFR, PLCB4, ADCY9, PRKG2
   bta00240:Pyrimidine metabolism 0.0332 NME5, PRIM2, POLR3A, POLR2D
   bta00230:Purine metabolism 0.0340 NME5, ADCY9, PRIM2, POLR3A, POLR2D
   Gene Ontology Cellular Component
   0045211~postsynaptic membrane 0.0072 KCNMA1, GRIK2, CLSTN1, NSG1, HTR3B
   0030027~lamellipodium 0.0103 CORO1C, NF2, ARHGEF7, ACTN1, AMOTL1
   0005925~focal adhesion 0.0105 CORO1C, EGFR, SYNE2, ARHGEF7, LPP, GIT2,
   ACTN1, SYNPO2
   0016328~lateral plasma membrane 0.0287 CORO1C, NSG1, ACTN1
   Gene Ontology Biological Process
   0019228~neuronal action potential 0.0151 KCNMA1, GRIK2, SCN5A
   0035023~regulation of Rho protein signal transduction 0.0224 ARHGEF4,
   VAV3, ARHGEF7, ARHGEF10L
   0030032~lamellipodium assembly 0.0240 ARHGEF4, VAV3, ARHGEF7
   0045184~establishment of protein localization 0.0240 CORO1C, SMYD3, MCC
   0030036~actin cytoskeleton organization 0.0306 CORO1C, NF2, DIAPH3,
   DAAM1
   0042384~cilium assembly 0.0486 CEP295, NME5, TTBK2, AHI1
   Gene Ontology Molecular Function
   0050518~2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase
   activity 0.0147 ISPD
   iHS dairy Gene Ontology Cellular Component
   0097431~mitotic spindle pole 0.0174 NUMA1, EML1
   Gene Ontology Biological Process
   0007420~brain development 0.0201 EML1, RAB18, SLC6A17
   Gene Ontology Molecular Function
   0043140~ATP-dependent 3'-5' DNA helicase activity 0.0114 RECQL5, ASCC3
   0016887~ATPase activity 0.0166 ABCG5, MACF1, CLPB
   0005524~ATP binding 0.0247 STK33, ABCG5, RECQL5, ASCC3, CLPB, PAK1,
   CPS1
   XP-EHH dairy KEGG Pathway
   Gene Ontology Biological Process
   GO:0030326~embryonic limb morphogenesis 0.0107 FRAS1, PBX1, ALX1
   GO:0032964~collagen biosynthetic process 0.0250 TRAM2, ADAMTS3
   GO:0009952~anterior/posterior pattern specification 0.04771 GRSF1,
   PBX1, ALX1
   Gene Ontology Molecular Function
   GO:0005096~GTPase activator activity 0.103 RASAL2, ELMOD3, RALGAPA1,
   TBC1D22B, ARHGAP25
   [201]Open in a new tab

   The regulation of the actin cytoskeleton (bta04810) and tight junction
   (bta04530) are pathways from the cellular process class, related to
   cellular community and cell motility, respectively. The actin
   cytoskeleton is a dynamic skeletal cell structure that contains actin
   and associated proteins, which has been associated with important
   functions in several biological processes including early development
   of oocyte organization and maturation [[202]73], tenderness of
   Longissimus dorsi muscle from Qinchuan cattle [[203]74], Longissimus
   dorsi intramuscular adipose tissue in Hanwoo cattle [[204]75], response
   to diet [[205]76] and calf birth weight in Holstein cattle [[206]77],
   and response to several pathogens such as Trypanosoma congolense
   [[207]78], Salmonella enterica [[208]79], and Mycobacterium bovis
   [[209]80].

   Tight junctions are essential for establishing a selectively permeable
   barrier to diffusion through the paracellular space between cells,
   which are composed of transmembrane proteins that are involved in
   junction assembly, barrier regulation, cell polarity, gene
   transcription, and among other pathways. Tight junctions are known to
   be related to milk mammary gland development and milk secretion,
   controlling the transport of lactose and K+ to the extracellular fluid,
   whereas Cl− and Na+ are transported into milk [[210]81, [211]82]. Since
   decreasing in the tight junction permeability results in increasing
   milk secretion, the tight junction pathway (bta04530) has been reported
   involved in milk production and quality traits in dairy cattle
   [[212]83–[213]85].

   The functional enrichment analysis with the XP-EHH dairy test results
   revealed one GO molecular function and three GO biological processes
   ([214]Table 3). The GTPase activator activity (GO:0005096) is a
   molecular function related to GTPase-activating proteins (GAP), a
   family of regulatory proteins that can bind to activate G-proteins and
   stimulate its intrinsic GTPase activity. Regulation of G-proteins is
   important because these proteins are involved in important cellular
   processes and physiological functions [[215]86]. Regarding to GO
   biological processes identified by the XP-EHH dairy test, all of them
   were involved in the embryonic development and specification of
   patterns of cell differentiation. The embryonic limb morphogenesis
   (GO:0030326) is a biological process that occur in the embryo by which
   the anatomical structures of the limp are generated and organized; the
   anterior/posterior pattern specification (GO:0009952) is defined as a
   regionalization process in which specific areas of cell differentiation
   are determined along the anterior-posterior axis (line that runs from
   the head to the tail of an organism), while the collagen biosynthetic
   process (GO:0032964) is related to chemical reactions and pathways
   resulting in the formation of collagen that is a group of proteins that
   form the main component of connective tissue in animals. Considering
   the intrinsic features of each test, genes identified by the iHS
   approach are potentially involved in recent selection pressures.
   Conversely, genes identified by the Fst method are likely related with
   events occurring further in the past. However, there is no guarantee
   that captured patterns of genotypic variation are a result of selection
   alone. It can be a result of other unrelated ancestral events [[216]20]
   or even it may represent false positive results. To control the false
   positive rate, we applied the FDR adjustment within each population as
   recommended in the literature [[217]31]. For the Fst, we used a
   technique similar to a control chart, but in a simplistic way that was
   based on measures of central tendency and dispersion.

   Even though we used a tool available in the fdrtool R package [[218]35]
   to control for false positives in the iHS test, it is difficult to
   completely rule out potential false positives. Teshima et al. (2006)
   [[219]87] advised that error rates can be further decreased by
   combining several statistics. Even when a stringent cutoff of 1% for
   significance is defined, a fraction of loci identified as targets of
   selection may, in fact, be false discoveries [[220]87]. In this way, it
   is possible that the target of selection that stood out in the Fst,
   iHS, and XP-EHH methods may contain a fraction of false discoveries.
   Genetic historical events different from artificial selection such as
   bottleneck, genetic drift, and migration could shape the population
   genomic variation [[221]88, [222]89]. In addition to the effects of
   positive selection, genetic drift might be the other cause of genetic
   sweep altering the genetic structure of the two cattle populations.
   Further investigations need to be done to distinguish demographic
   history from selection in the two populations used in this study.
   Methods to compare models of positive selection relative to
   nonequilibrium models have been proposed by Jensen et al., 2007
   [[223]88] to accurately define the demographic events in a population.

   The number of animals used in the present study is not a major concern
   in the structure and signatures of selection analysis. Several studies
   have focused on reduced number of animals per population, for example,
   <150 animals [[224]18, [225]90] and <550 animals [[226]7, [227]36].
   This issue can be solved by the choice of appropriate methods among the
   large diversity of existing methods in the literature. As we have seen,
   the approaches used in the present study are largely recommended and
   powerful for comparison and studying genotypic variation in different
   populations.

   The use of non-related animals in population structure studies is
   difficult in livestock species, although the choice for non-related
   animals allows for high genetic differentiation. For a small livestock
   population, the choice for animals to be genotyped are usually made by
   their importance in the herd. It is possible that the genotyped animals
   are related because breeding programs use tools as non-random mating
   systems and reproductive technologies, which can increase the
   inbreeding level in the population. In the case of Gir cattle, the
   level of inbreeding is high and the effective population size is small
   for the dairy [[228]1] and the beef populations. We consider that each
   breeding program was well represented by the chosen animals. Further
   improvements for the present study can be probably achieved by using a
   greater variety of SNP in the genotyping panel.

   We found evidence of signatures of selection for two Gir cattle
   populations artificially selected for different purposes (e.g., meat or
   milk production). The detection of selection signatures used here can
   act as complementary information to current gene mapping approaches
   (GWAS). By comparing candidate gene regions found through the
   identification of selection signatures and GWAS, it is possible to test
   the contribution of genes under selection to phenotypes, which can
   subsequently be used in genomic selection [[229]9]. The findings of the
   present study provided preliminary details about the recent adaptation
   of Brazilian Gir cattle. The loci that we identified as selection
   signatures provided information about genes and pathways in which the
   two Gir populations have adapted to the selective pressures. Strong
   selection on each population led to specialization of these
   populations. As we have seen in the investigation of genes and
   pathways, traces associated with fertility, milk production, beef
   quality, and growth were involved in this process. Furthermore, these
   selective signals indicated the presence of genetic variants that must
   affect complex phenotypes of particular interest for conservation and
   genetic improvement of the Brazilian Gir cattle. As a whole, the
   results found here give basic support for further investigations in the
   Gir breed.

Conclusion

   High-throughput genomic information such as SNP markers can be
   successfully used to study the population structure and to identify
   genomic regions undergone divergent selection in Gir cattle. The
   difference in breeding history was able to imprint a degree of genomic
   differentiation for both beef and dairy populations. The patterns of
   genotypic variation in Gir cattle were consistent with the presence of
   selective pressures at some point in the history of the beef and dairy
   populations. These findings can provide complementary information on
   genomic regions of interest for functional genomic studies, genome-wide
   associations, and the implementation of breeding schemes aiming genetic
   improvement and conservation of livestock populations.

Supporting information

   S1 Fig. Distribution of Fst values for chromosomes 1 and 25.

   (TIF)
   [230]Click here for additional data file.^ (4.9MB, tif)
   S1 File. Intergenic regions harboring signatures of selection in the
   Brazilian Gir cattle.

   (XLSX)
   [231]Click here for additional data file.^ (40.8KB, xlsx)

Data Availability

   All relevant data are within the paper and its Supporting Information
   files.

Funding Statement

   The authors would like to acknowledge the Foundation for Research
   Support of the State of São Paulo – FAPESP (grant number 2015/06686-7),
   the National Council for Scientific and Technological Development -
   CNPq (scholarship number 142373/2015-0) and the Brazilian Federal
   Agency for Support and Evaluation of Graduate Education - CAPES
   (scholarship number 88881.131671/2016-01) for support and the Institute
   of Animal Science (IZ), Sertãozinho, SP, for providing the data used in
   this study.

References