Abstract

   Background: Water buffaloes represent a crucial genetic resource for
   the global dairy industry, yet enhancements in their production
   performance remain relatively constrained. The advent of advanced
   sequencing technologies, coupled with genome-wide association studies
   (GWASs), has significantly boosted the potential for breeding
   superior-quality water buffalo. Methods: An integrated genomic analysis
   was performed on sequencing data from 100 water buffaloes, utilizing
   the high-quality UOA_WB_1 genome assembly as a reference. This study
   particularly emphasized reproduction-related traits, with a focus on
   age at first calving (AFC). Results: Our analysis revealed two
   significant single-nucleotide polymorphisms (SNPs). Based on these
   genetic markers, the GRM1 gene was identified as a candidate gene. This
   gene shows substantial involvement in various reproduction-associated
   pathways, including the FOXO signaling pathway, calcium signaling
   pathway, and estrogen signaling pathway. Conclusions: The
   identification of GRM1 as a candidate gene provides a robust
   theoretical basis for molecular breeding strategies aimed at enhancing
   fertility in water buffaloes. These findings offer critical scientific
   support for optimizing breeding programs, thereby improving overall
   production efficiency.

   Keywords: buffalo, production traits, genome-wide association study

1. Introduction

   Water buffaloes play a crucial role in both meat and milk production,
   with their milk contributing significantly to the global dairy sector.
   Specifically, buffalo milk accounts for over 15% of global milk yield,
   highlighting the essential role of this species in the dairy industry.
   This statistic underscores the importance of focusing on buffalo
   welfare, productivity, and reproductive efficiency to ensure the
   continued growth and sustainability of this vital livestock sector
   [[36]1]. Buffalo milk is distinguished by its elevated levels of fat,
   protein, and minerals when compared to cattle milk, which contributes
   to its nutritional superiority and economic value. This nutritional
   profile is a key factor in the greater prevalence of buffaloes in Asia,
   where they outnumber many other livestock species. The unique
   composition of buffalo milk not only enhances its suitability for dairy
   products but also reinforces the strategic importance of buffaloes in
   the agricultural economies of the region [[37]2]. One critical aspect
   of enhancing buffalo milk production lies in improving reproductive
   efficiency, particularly through the optimization of age at first
   calving (AFC). AFC is a crucial reproductive trait that directly
   impacts the overall productivity and sustainability of buffalo herds.
   Identifying genetic markers and pathways that regulate AFC could lead
   to significant improvements in reproduction rates [[38]3]. By
   pinpointing these genetic factors, we can develop strategies to shorten
   the AFC, thereby increasing the number of calves produced over the
   lifetime of each buffalo. This would not only enhance the reproductive
   efficiency of buffalo populations but also contribute to the overall
   economic viability of buffalo farming. Therefore, understanding and
   manipulating the genetic determinants of AFC is an essential area of
   focus for researchers, farmers, and stakeholders aiming to maximize the
   potential of the buffalo milk industry [[39]4].

   Traditional breeding methods for buffaloes face significant challenges
   due to various factors, including unstable estrus conditions and
   prolonged calving intervals. These issues complicate the breeding
   process and highlight the need for innovative approaches to enhance
   reproductive efficiency and success [[40]5]. Modern genetic
   technologies offer promising strategies to address these challenges.
   Single-nucleotide polymorphisms (SNPs) represent one of the most
   fundamental and prevalent forms of genetic variation. Whole-genome
   sequencing is a powerful technique that allows for the identification
   of genetic variations across an organism’s entire genome. This method
   not only provides valuable insights into intrinsic genomic information
   but also plays a crucial role in the study of human diseases, crop
   improvement, and livestock breeding [[41]6]. Whole-genome association
   analysis is a potent tool for investigating complex genetic traits and
   identifying candidate genes. By analyzing genetic variations and
   polymorphisms throughout the entire genome, this approach facilitates
   the detection of genomic regions and genes associated with specific
   traits of interest [[42]7]. For instance, applying whole-genome
   association analysis to buffalo breeding can help identify SNPs linked
   to key reproductive traits, such as age at first calving (AFC) [[43]8].
   Understanding and leveraging these genetic markers can lead to more
   effective breeding programs, ultimately enhancing the reproductive
   performance and overall productivity of buffalo herds. Therefore,
   integrating advanced genetic technologies into buffalo breeding
   practices is essential for overcoming current limitations and
   maximizing the economic potential of buffalo farming [[44]9].

   Globally, in previous years, several studies have been conducted,
   analyzing the correlation between SNPs and reproductive traits. In
   their study, Islam et al. [[45]10] embarked on a comprehensive
   genome-wide association analysis involving 167 buffaloes, with the
   primary objective of discovering novel candidate genes linked to traits
   such as body weight and production performance. This analysis aimed to
   shed light on the genetic underpinnings of these economically important
   characteristics. The study employed the 90K Axiom Buffalo SNP Array to
   detect SNPs within a range of 2000 base pairs upstream of the buffalo
   FABP gene [[46]11]. They identified SNP sites within a specific region
   and performed an association analysis to examine the relationship
   between these SNPs and various reproduction-related traits. Despite the
   increasing interest in the application of whole genome association
   analysis in studying reproductive-related traits in water buffaloes,
   the number of studies conducted in this area remains limited when
   compared to the extensive body of research in cattle. This highlights
   the need for further investigation in this field to deepen our
   understanding of the genetic factors underlying reproductive traits in
   water buffaloes.

   In this study, using four statistical models, linkage-disequilibrium
   iteratively nested keyway (Blink) [[47]12], fixed and random model
   circulating probability unification (FarmCPU) [[48]13], generalized
   linear models (GLM_Q) [[49]14], and settlement of mixed linear models
   under progressively exclusive relationships (SUPER) [[50]15,[51]16], we
   fully identified two high-confidence SNPs. Through variant annotation
   and QTL mapping, we detected GRM1 candidate genes that could
   potentially affect the reproductive traits. By understanding these
   factors, we aim to provide valuable insights that will guide our future
   research and contribute to the development of more effective breeding
   strategies.

2. Materials and Methods

2.1. Ethics Statement

   All finished work was conducted in accordance with national and
   international guidelines. The protocol for this study was approved by
   the Attitude of the Animal Care & Welfare Committee of the Guangxi
   Buffalo Research Institute (Approval Code: GXU2019-021).

2.2. Phenotypes and Animal Resources

   This study utilized data collected from 100 water buffaloes at the
   Guangxi Buffalo Research Institute in China from 2000 to 2021. The
   sample comprised the following: 1 Guangxi local water buffalo (DB), 44
   hybrid water buffaloes (ZB), 23 Nili-Ravi water buffaloes (NB), and 32
   Murrah water buffaloes (MB). The Murrah water buffaloes are purebred
   animals originally imported from India, while the Nili-Ravi water
   buffaloes are purebred animals initially introduced from Pakistan. The
   hybrid water buffaloes included in this study were produced by crossing
   Nili-Ravi and Murrah breeds. The target trait of this study is age at
   first calving (AFC).

2.3. Sample Collection and Sequencing

   Blood samples from water buffaloes were obtained through tail vein
   puncture by utilizing a vacuum blood collector. Genomic DNA was
   extracted from the blood using the phenol/chloroform method, and its
   integrity and yield were evaluated via agarose gel electrophoresis. The
   DNA libraries were sequenced on the Illumina sequencing platform
   (Illumina HiSeqTM 2000) by Genedenovo Biotechnology Co., Ltd.
   (Guangzhou, China).

2.4. Alignments and Variant Identification

   The clean reads were aligned to the reference genome (UOA_WB_1) using
   BWA-MEM (v0.7.17) with default settings [[52]10]. Then, Samtools
   (v1.9), Picard tools (v3.1.1), and GATK (v4.0) were used for SNP
   detection [[53]17,[54]18]. All detected SNPs underwent filtering
   through the “Variant Filtration” module of GATK, using the following
   standard parameters: variants with Quality Depth (QD) < 2; FS
   (Phred-scaled p-value using Fisher’s exact test for strand bias
   detection) > 60; MQRankSum (Z-score of the rank sum of the Phred-scaled
   mapping qualities) < −12.5; ReadPosRankSum (Z-score of the rank sum of
   the Phred-scaled position bias estimations) < −8; MQ (root mean square
   of the mapping quality) < 40.0; the mean sequencing depth of variants
   (across all individuals) was limited to less than 1/3× and more than
   3×; SOR (strand odds ratio) > 3.0; the maximum missing rate was less
   than 0.1; and SNPs were limited to two alleles.

2.5. Variation Filtering

   The presence of rare alleles (alleles with low frequency within the
   population), high rates of missing data, and substantial heterozygosity
   at specific loci can introduce anomalies in population analysis and
   whole-genome association studies. Therefore, we aligned the processed
   reads to the reference genome (UOA_WB_1). Subsequently, we employed
   PLINK (v1.9) software to filter the detected loci based on standard
   criteria [[55]19]. The filtering process involved stringent adherence
   to several criteria: exclusion of non-biallelic SNPs, removal of those
   with a minor allele frequency below 0.05, discarding SNPs with a
   missing genotype rate exceeding 30%, and further limiting the analysis
   to SNPs with a heterozygosity ratio below the threshold of 0.8. This
   was all executed using robust PLINK (v1.9) software.

2.6. Principal Component Analysis

   GCTA (v1.92.2) is a robust tool for the analysis of whole-genome
   complex traits [[56]20]. In this study, we utilized GCTA (v1.92.2) and
   PLINK (v1.9) software to perform PCA (Principal Component Analysis)
   using the selected SNP markers. This analysis enabled us to derive the
   variance accounted for by each PC (principal component) and the score
   matrix representing the samples’ positions within each PC.

2.7. Population Structure Analysis

   Population structure analysis offers valuable insights into the
   ancestry and composition of individuals, rendering it an exceptionally
   effective approach for elucidating genetic relationships. To validate
   the outcomes of the PCA, we performed a population structure analysis.
   Model-based population structure inference methods typically assume
   that the markers utilized for analysis are independent of one another.
   Consequently, prior to initiating the analysis, it is essential to
   execute marker independence filtering, which is based on the assessment
   of linkage disequilibrium between markers. In this analysis, we
   utilized PLINK (v1.9) and Admixture software (v1.3) [[57]21] to perform
   marker filtering for population structure analysis. In our analysis, we
   implemented a 100 kb step size and a 10 nucleotide (nt) window size,
   and we removed one marker from each pair of markers with an r^2 value
   greater than 0.2. Specifically, we removed the marker with the higher
   physical position from each pair of markers with a high degree of
   linkage disequilibrium. As a result of implementing the aforementioned
   filtering strategy, we retained a total of 272,462 markers for the
   population structure analysis. Utilizing the filtered SNP markers, we
   conducted a principal component analysis (PCA) using PLINK to
   investigate the population structure and clustering patterns. The PCA
   results were visualized to illustrate the relationships among the first
   three principal components, providing insights into the genetic
   diversity and relatedness within the population. Additionally, we
   employed Admixture software (v1.3) to perform an in-depth analysis of
   population structure, estimating the proportion of ancestry from K
   ancestral populations and identifying subpopulations within the
   dataset. In our analysis, we explored the cross-validation (CV) error
   for various k-values using Admixture software (v1.3), ranging from 2 to
   9. We utilized PopHelper software (v2.2.7) [[58]22] to generate bar
   plots illustrating the genetic composition of each sample within every
   subgroup. By systematically testing these k-value hypotheses, we aimed
   to identify the optimal number of clusters that would provide the most
   meaningful and informative partitioning of the studied populations.

2.8. Genome-Wide Association Mapping

   A genetic relationship matrix was constructed based on the genotype
   using TASSEL (version 5.2.86) [[59]23], and principal component
   analysis (PCA) was implemented with PLINK software (version 1.9.0). We
   used GAPIT (version 3.0) [[60]16] to identify candidate SNPs associated
   with AFC traits, which is widely used to identify QTLs [[61]24], and
   analyze the association between the molecular markers and traits of the
   mixed population based on 4 models: GLM (Q) [[62]14], SUPER [[63]15],
   FarmCPU [[64]13], and Blink [[65]12]. To correct the influence of
   population structure and reduce the false positive rate, the kinship
   matrix and the first 3 PCA scores were fitted as random and fixed
   effects, respectively [[66]25]. Manhattan and quantile-quantile (Q-Q)
   plots were drawn with the CMplot package [[67]26] in R software
   (version 4.2.2), and Venn diagrams were plotted using the ggvenn
   package ([68]https://CRAN.R-project.org/package=ggvenn, accessed on 1
   October 2024) in R software (version 4.2.2).

   Generalized Linear Models are a widely employed and versatile
   statistical method for data analysis [[69]14]. In the present study, we
   utilized Generalized Linear Models for conducting a genome-wide
   association analysis.

   The model of GLM (Q) [[70]14] is as follows:
   y = Xα + Zβ + Wμ + e

   where y is the phenotypic trait to be studied; α is the fixed effect,
   which is another factor affecting the phenotype and generally refers to
   population structure; β is the marker effect, which refers to SNPs; μ
   is the random effect, X, Z, and W are the incidence matrices for α, β,
   and μ, respectively; and e is the residual of the model.

   The model of SUPER [[71]15] is as follows:
   y = Xα + Zβ + Wμ + e (1)
   L(y|β, σa2, σe2, s, b) (2)
   L(y|α, β, σa2, σe2, s^, b^) (3)

   where Equation (1) has the same parameters as the GLM (Q) model, and
   Equation (2) is the solution of all unknown parameters under which
   Equation (1) determines the observations y with maximum likelihood,
   where σa2 and σe2 are the genetic variance and residual variance, s and
   b are the number and size of bins, and s^ and b^ are the estimates to
   maximize the number and size of bins, respectively.

   The model of FarmCPU [[72]13] is as follows:
   yi = Si1 b1 + Si2 b2 + … + Sit bt + Hij dj + ej (4)
   yi + ui + ei (5)

   where yi is the observed value of the ith individual; Si1, Si2,…, Sit
   are the genotypes of the pseudo quantitative trait nucleotides (QTNs);
   b1, b2,…, bt are the corresponding effects of the pseudo QTNs; Hij is
   the jth genetic marker by the ith individual; dj is the effect value
   for the marker; and ei is the residuals of the model. yi and ui in
   Equation (5) are the same as in Equation (4), and ui is the total
   genetic effect of the ith individual.

   The model of Blink [[73]12] is as follows:
   yi = Si1 b1 + Si2 b2 + … + Sit bt + Sij dj + ej (6)
   yi = Si1 b1 + Si2 b2 + … + Sit bt + ei (7)
   BIC = –2LL + 2tLn(n) (8)

   where LL is the log likelihood, t is the number of pseudo QTNs, Ln is
   the natural log, n is the number of individuals, and the symbolic
   meanings of Equations (6) and (7) are the same as those of Equation (6)
   in FarmCPU.

   The outcomes are presented through Manhattan plots and Q-Q plots. SNPs
   exhibiting p-values below the specified threshold of 0.05/N (number of
   SNP) are identified as highly significant SNPs. When a reference genome
   is available, candidate genes are determined by including those genes
   that are physically positioned within a 50 kb genomic region
   surrounding the significant SNPs. DbSNP [[74]27] is a database
   specifically designed by the NCBI to store genetic variation
   information. We used dbSNP to determine whether the SNPs we identified
   are located in the coding regions of genes.

2.9. Pathway Enrichment and Protein–Protein Interaction

   Genes often work in concert to perform specific biological functions.
   Pathway-based analysis is a valuable approach for understanding the
   roles of genes in these complex processes. The KEGG (Kyoto Encyclopedia
   of Genes and Genomes) database [[75]28] stands as one of the foremost
   publicly accessible resources for pathway-related data. To identify
   significantly enriched metabolic and signal transduction pathways among
   CAGs (Candidate Associated Genes) relative to the entire genome
   context, pathway enrichment analysis was performed. The method for
   calculating enrichment is consistent with that employed in Gene
   Ontology (GO) [[76]20] analysis:
   [MATH: <mrow><mrow><mi>P</mi><mo>=</mo><mn>1</mn><mo>−</mo><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn
   >0</mn></mrow><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow></munderover><
   /mstyle><mfrac><mrow><mo
   stretchy="false">(</mo><mtable><mtr><mtd><mi>M</mi></mtd></mtr><mtr><mt
   d><mi>i</mi></mtd></mtr></mtable><mo stretchy="false">)</mo><mo
   stretchy="false">(</mo><mtable><mtr><mtd><mi>N</mi><mo>−</mo><mi>M</mi>
   </mtd></mtr><mtr><mtd><mi>n</mi><mo>−</mo><mi>i</mi></mtd></mtr></mtabl
   e><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">(</mo><mtable><mtr><mtd><mi>N</mi></mtd></mtr><mtr><mt
   d><mi>n</mi></mtd></mtr></mtable><mo
   stretchy="false">)</mo></mrow></mfrac></mrow></mrow> :MATH]

   In this context, N signifies the total count of genes with KEGG
   annotations, while n denotes the number of CAGs within N. M represents
   the total number of genes annotated to particular pathways, and m is
   the number of CAGs in M. Following the calculation of the p-value, it
   was corrected using False Discovery Rate (FDR) adjustment, with an FDR
   value of 0.05 or less being set as the threshold. Pathways that met
   this criterion were categorized as significantly enriched pathways in
   CAGs. Finally, we utilized the String database to identify genes that
   are significantly represented in pathways and created a protein–protein
   interaction map. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) analysis were performed using OmicShare tools, a free
   online platform for data analysis ([77]http://www.omicshare.com/tools,
   accessed on 1 October 2024).

2.10. Statistical Analysis

   Statistical analyses were performed using the SPSS 18.0 software
   package (SPSS Science, Chicago, IL, USA). Experimental data were
   subjected to t-test and ANOVA analyses, with a significance threshold
   set at p < 0.05. Graphs were generated using GraphPad Prism 8 software
   (GraphPad, Santiago, USA). Data are presented as the mean ± standard
   deviation (SD).

3. Results

3.1. Phenotypic Value Statistics of the Traits

   During the phenotypic evaluation of the buffaloes, we analyzed the age
   at first calving (AFC) as a key production trait. Specifically, for the
   AFC phenotypic value analysis: The mean value of AFC was recorded as
   3.67 years; detailed phenotypic data statistics are presented in
   [78]Figure 1 and [79]Table 1. Our results, supported by relevant
   literature, clearly indicate that AFC is a critical factor influencing
   reproductive efficiency. Previous studies have identified 4 years as a
   significant threshold, suggesting that buffaloes with an AFC greater
   than 4 years tend to exhibit lower reproductive performance. This
   comprehensive analysis provides insights into the reproductive
   performance of buffaloes, with particular emphasis on the AFC trait.

Figure 1.

   [80]Figure 1
   [81]Open in a new tab

   AFC histogram. AFC, age at first calving (year).

Table 1.

   Statistical description of fertility traits *.
   Traits Mean  SD  Min  Max
    AFC   3.67 0.88 2.17 6.54
   [82]Open in a new tab

   * SD, standard deviation. AFC, age at first calving.

   In this study, a comprehensive analysis of genome-wide variations led
   to the identification of 2,208,174 genetic markers. Among the detected
   genetic markers, 2,012,270 were identified as SNPs and 195,904 were
   classified as insertion–deletion (Indel) variants. Following stringent
   filtering criteria, a refined set of 272,462 markers was retained,
   comprising 255,453 SNPs and 17,009 Indels.

3.2. Population Structure

   Upon obtaining PCA scores, the samples under investigation can be
   visualized via a scatter plot that utilizes the values of the first
   three principal components as axes. Referring to [83]Figure 2, in
   scatter plots [84]Figure 2A, it is evident that the majority of
   individuals within herds MB and NB are distinctly isolated from one
   another. In scatter plot [85]Figure 2B, we can observe clustering,
   particularly in groups MB and ZB, where these two clusters overlap and
   are closely grouped together in multiple instances. In scatter
   [86]Figure 2A,C, it is evident that the herds are broadly segregated
   into three distinct clusters. One cluster predominantly consists of NB
   herds, whereas the other two clusters are primarily composed of MB, ZB,
   and DB herds, respectively.

Figure 2.

   [87]Figure 2
   [88]Open in a new tab

   Graphs displaying the sample clustering obtained from PCA through three
   two-dimensional scatter plots, namely scatter (A), scatter (B), scatter
   (C), and scree plots (D). The percentage of variance explained by each
   PC is noted in parentheses. In the scatter plots, colored circles
   represent four different groups: DB, MB, NB, and ZB correspond to 1 DB,
   32 MB, 23 NB, and 44 ZB, respectively.

   To establish the optimal number of clusters (k), we used Admixture
   software and evaluated the cross-validation error rates. The Admixture
   algorithm performs model-based clustering and estimates the proportion
   of ancestry from K ancestral populations. By minimizing the
   cross-validation error rates, we identified the value of k = 3 that
   best fits our data. [89]Figure 3 displays the line graph depicting the
   cross-validation error rate.

Figure 3.

   [90]Figure 3
   [91]Open in a new tab

   The line graph illustrating the cross-validation error rate is
   depicted, with the number of sample clusters delineated along the
   x-axis and the corresponding cross-validation error rate indicated on
   the y-axis.

   To simulate the population classification and genetic ancestry of each
   sample across varying numbers of subgroups (K = 2–9), we utilized
   PopHelper software (v2.2.7) [[92]22] to generate bar plots illustrating
   the genetic composition of each sample within every subgroup. The
   results are presented in [93]Figure 4, where each color corresponds to
   a distinct cluster for each K-value. From the line graph of the
   cross-validation error rate, it is evident that the optimal number of
   clusters is K = 3. Similarly, as observed in the bar graph ([94]Figure
   3) depicting the genetic composition of the samples, when K = 3, it is
   the optimal number of clusters for these 100 buffaloes. This finding is
   consistent with the results obtained from the PCA analysis.
   Consequently, we conclude that these 100 buffaloes can be effectively
   divided into three distinct subgroups.

Figure 4.

   [95]Figure 4
   [96]Open in a new tab

   Genetic bar chart illustration for K-means clustering with varying
   numbers of clusters (K = 2 to 9).

3.3. Results of the Genome-Wide Associations

   Following the calculation of p-values for the SNP loci using a
   generalized linear model, we constructed a Manhattan plot and a Q-Q
   plot, as presented in [97]Figure 5.

Figure 5.

   [98]Figure 5
   [99]Open in a new tab

   Association analysis of age at first calving (AFC) traits in water
   buffaloes using different GWAS methods. (A) GLM-Q, (B) BLINK, (C)
   SUPER, (D) and FarmCPU. (E) Venn plot of significant SNPs identified
   using all four methods. The Manhattan plot on the left, created using
   the qqman package, illustrates the p-values for SNP markers across 25
   chromosomes (comprising 24 autosomes and one X chromosome). The blue
   line delineating the Manhattan plot signifies the significance
   threshold, determined by 0.05/N (number of SNPs). Markers that surpass
   this threshold are deemed significant. The plot on the right is a Q-Q
   plot, where the x-axis denotes the observed values of the markers, and
   the y-axis represents the expected values, which have been transformed
   into the −10 log scale.

   The leftmost plot is a Manhattan plot, where the visually discernible
   blue line, running parallel to the x-axis, serves as the critical
   demarcation line. In the Manhattan plot, the points that rise above the
   threshold line, which is the blue line paralleling the x-axis, signify
   significant loci. Upon identifying the significant loci that surpass
   the threshold line in the Manhattan plot, the subsequent step involves
   documenting the relevant information pertaining to these significant
   loci.

   The plot on the right is the Q-Q plot, where the points in the bottom
   left corner fall along the line, suggesting that the observed p-values
   align closely with the expected values. The points exhibit a distinct
   upward deviation from the diagonal in the upper right corner, which
   signifies that the observed p-values exceed the anticipated values. The
   presence of these points, which denote significant loci, across all
   four plots underscores the validity of the analytical model, suggesting
   its appropriateness in capturing the underlying patterns. Following
   this, the genes situated within a 50 kb range of the significant loci
   are carefully selected and earmarked as candidate genes.

   In our GWAS study, we analyzed significant loci using four different
   GWAS methods (GLM-Q, BLINK, SUPER, and FarmCPU) ([100]Figure 5A–D). Our
   analysis revealed that SUPER and GLM(Q) identified 34 significant SNPs
   ([101]Figure 5E), which were annotated to 30 genes (Detailed pathway
   information for these candidate genes: [102]Supplementary Table S1). To
   ensure the robustness and reliability of our findings, we focused on
   the significant loci identified using all four methods (Blink, SUPER,
   GLM(Q), and FarmCPU). This stringent approach enabled us to pinpoint
   two statistically significant SNPs and identify one candidate gene
   ([103]Table 2) within a 50 Kb range surrounding these loci,
   specifically associated with the trait age at first calving (AFC).

Table 2.

   The SNPs identified via genome-wide association analysis encompass
   detailed information regarding their chromosomal locations, p-values,
   and associated candidate genes.
   Methods SNP Chr Pos p R2 Candidate Genes
   SUPER/GLM(Q)/Blink/FarmCPU 1 [104]NC_037554.1 20,660,384 5.0621 × 10^−8
   0.49381 GRM1
   SUPER/GLM(Q)/Blink/FarmCPU 2 [105]NC_037556.1 50,669,172 6.5734 × 10^−8
   0.42667 --
   SUPER/GLM(Q) 3 [106]NC_037545.1 35,877,289 6.8853 × 10^−8 0.34081 --
   SUPER/GLM(Q) 4 [107]NC_037546.1 31,601,350 3.7412 × 10^−9 0.41127 OPN5
   SUPER/GLM(Q) 5 [108]NC_037548.1 4,405,934 6.4571 × 10^−8 0.33847
   FAM118A; UPK3A; KIAA0930
   SUPER/GLM(Q) 6 [109]NC_037548.1 157,620,852 5.3121 × 10^−8 0.3642 --
   SUPER/GLM(Q) 7 [110]NC_037551.1 20,700,724 8.906 × 10^−9 0.3905 --
   SUPER/GLM(Q) 8 [111]NC_037551.1 52,947,579 1.05 × 10^−7 0.35825 --
   SUPER/GLM(Q) 9 [112]NC_037552.1 24,460 6.7407 × 10^−8 0.29687 --
   SUPER/GLM(Q) 10 [113]NC_037552.1 112,491,377 3.2491 × 10^−8 0.32257
   ZNF777; ZNF746
   SUPER/GLM(Q) 11 [114]NC_037552.1 113,607,118 2.4523 × 10^−8 0.33352
   ABCB8; ASIC3; CDK5; SLC4A2;
   FASTK; TMUB1; AGAP3
   SUPER/GLM(Q) 12 [115]NC_037553.1 15,992,633 2.4361 × 10^−8 0.36112
   FAM81B
   SUPER/GLM(Q) 13 [116]NC_037554.1 9,287,335 9.7346 × 10^−8 0.31436
   TULP4; GTF2H5; SERAC1
   SUPER/GLM(Q) 14 [117]NC_037554.1 16,312,432 1.3425 × 10^−9 0.3879 --
   SUPER/GLM(Q) 15 [118]NC_037554.1 99,978,913 1.0129 × 10^−7 0.33706 --
   SUPER/GLM(Q) 16 [119]NC_037555.1 32,169,042 1.5181 × 10^−7 0.33459 --
   SUPER/GLM(Q) 17 [120]NC_037556.1 75,679,149 8.5899 × 10^−8 0.36278 --
   SUPER/GLM(Q) 18 [121]NC_037556.1 80,488,833 1.413 × 10^−8 0.39141
   KCNS3; MSGN1
   SUPER/GLM(Q) 19 [122]NC_037556.1 81,118,534 7.4243 × 10^−9 0.36956 --
   SUPER/GLM(Q) 20 [123]NC_037558.1 8,094,566 8.613 × 10^−8 0.37056 YBX1;
   SLC2A10
   SUPER/GLM(Q) 21 [124]NC_037559.1 57,471,045 5.8845 × 10^−8 0.32884 --
   SUPER/GLM(Q) 22 [125]NC_037559.1 74,176,040 3.5625 × 10^−8 0.35173
   WISP1; NDRG1
   SUPER/GLM(Q) 23 [126]NC_037561.1 25,915,631 2.8368 × 10^−9 0.40653 --
   SUPER/GLM(Q) 24 [127]NC_037561.1 25,915,784 1.0496 × 10^−9 0.40948 --
   SUPER/GLM(Q) 25 [128]NC_037562.1 28,221,520 6.038 × 10^−8 0.34565 --
   SUPER/GLM(Q) 26 [129]NC_037562.1 28,221,526 6.038 × 10^−8 0.34565 --
   SUPER/GLM(Q) 27 [130]NC_037562.1 28,221,527 6.038 × 10^−8 0.34565 --
   SUPER/GLM(Q) 28 [131]NC_037562.1 39,920,231 2.277 × 10^−9 0.40894 HYDIN
   SUPER/GLM(Q) 29 [132]NC_037563.1 45,012,914 1.8896 × 10^−8 0.36036 --
   SUPER/GLM(Q) 30 [133]NC_037564.1 36,725,181 9.3193 × 10^−8 0.31929
   LINGO1
   SUPER/GLM(Q) 31 [134]NC_037566.1 36,784,161 4.9686 × 10^−8 0.3255
   B4GALT6
   SUPER/GLM(Q) 32 [135]NC_037568.1 13,951,865 6.9031 × 10^−8 0.33258
   CALN1
   SUPER/GLM(Q) 33 [136]NC_037569.1 5,832,499 2.107 × 10^−8 0.35009 KAL1
   SUPER/GLM(Q) 34 [137]NC_037569.1 38,961,581 1.0446 × 10^−7 0.29241 BCOR
   [138]Open in a new tab

3.4. Kyoto Encyclopedia of Genes and Genomes Pathway Analysis of Candidate
Genes

   The functional enrichment cycle diagram displayed the top 20 KEGG
   pathways ([139]Figure 6, [140]Supplementary Table S2), classified by
   –log10 (p-value), which revealed that the candidate genes were mainly
   enriched in five KEGG_A_class pathways, including environmental
   information processing, cellular processes, and organismal systems.
   Among these pathways, environmental information processing included the
   FOXO signaling pathway (ko04068), the phospholipase D signaling pathway
   (ko04072), the calcium signaling pathway (ko04020), and neuroactive
   ligand–receptor interaction (ko04080); cellular processes included gap
   junction (ko04540). In conclusion, our data suggest that the identified
   candidate genes play a crucial role in regulating fertility
   performance, particularly in terms of age at first calving (AFC), by
   modulating the FOXO signaling pathway, phospholipase D signaling
   pathway, and calcium signaling pathway. These findings provide valuable
   insights into the molecular mechanisms underlying fertility and could
   inform future breeding strategies to enhance reproductive performance
   in water buffaloes.

Figure 6.

   [141]Figure 6
   [142]Open in a new tab

   KEGG analysis of candidate genes. The enrichment circle diagram shows
   the KEGG analysis of the top 20 pathways. Four circles from the outside
   to the inside. First circle: The classification of enrichment; outside
   the circle is the scale of the number of genes. Different colors
   represent different categories. Second circle: The number and p-values
   of the classification in the background genes. The more genes, the
   longer the bars, the smaller the value, and the redder the color. Third
   circle: Bar chart of the total number of candidate genes. Fourth
   circle: rich factor value of each classification (the number of
   candidate genes in this classification divided by the number of
   background genes). Each cell of the background helper line represents
   0.1, and the color coding signifies the statistical significance of the
   corresponding enrichment.

3.5. Significant Association of Fertility with SNP Validation

   [143]Table 3 presents the results of individual genotyping for a single
   key locus in water buffaloes, [144]NC_037554.1:20660384 (GRM1).
   Combining these findings with the critical reproductive efficiency
   metric, calving interval (unpublished data), the analysis reveals
   significant differences in calving interval across different genotypes
   at this locus.

Table 3.

   The results of individual genotyping.
   Candidate Gene SNP (Chr:Pos) Calving Interval (Year)
   Homozygous Mutation Heterozygous Mutation Reference Genotype
   GRM1 [145]NC_037554.1:20660384 A/A G/A G/G
   1.02 ± 0.15 B 1.21 ± 0.23 B 1.52 ± 0.28 A
   [146]Open in a new tab

   The phenotypic values of calving interval (year) are expressed as
   “least squares mean ± standard deviation”. Different letters in the
   same column indicate significant differences (p < 0.05); the same
   letter or no letter indicates no significant difference (p > 0.05). The
   genotypes of SNP loci are arranged in the order of homozygous mutant,
   heterozygous, and reference types.

   For GRM1 ([147]NC_037554.1:20660384): The A/A and G/A genotype is
   associated with a significantly lower calving interval (year) compared
   to the G/G genotypes.

   These findings highlight the influence of specific genotypes on
   Fertility, providing valuable insights for genetic selection and
   breeding programs aimed at improving fertility in water buffaloes.

4. Discussion

4.1. Population Stratification

   Population stratification is a critical factor in GWAS studies, as it
   can introduce false associations due to differences in ancestral
   origins, leading to discrepancies in allele frequencies
   [[148]29,[149]30]. To mitigate this risk, we must account for
   population stratification in our dataset. Principal Component Analysis
   (PCA) helps reduce data complexity while preserving the covariance
   structure [[150]31]. We used PLINK to calculate the proportion of
   variance explained by the first 10 principal components with filtered
   SNP markers, assessing how well these components capture genetic
   variation.

   Additionally, we performed K-cluster analysis using Admixture software
   and evaluated cross-validation error (CV-error) to determine the
   optimal number of subpopulations. By minimizing CV-error, we identified
   the most appropriate value of K for population stratification. Despite
   being raised on the same farm, the Nili-Ravi buffaloes originate from
   Pakistan, while the Murrah buffaloes come from India. Our PCA plot
   shows clustering between groups MB and ZB, indicating a closer genetic
   relationship between these two groups compared to the others. Overall,
   the data suggest two main segments, but the CV-error plot indicates
   that the optimal number of subpopulations (K) is three. Therefore, we
   conclude that these 100 water buffaloes should be divided into three
   subgroups.

4.2. Genome-Wide Association Analysis of Reproductive-Related Traits

   Enhancing the reproductive rate of water buffalo has become a critical
   research priority within the dairy industry. Water buffaloes are
   crucial for global dairy production, especially in regions where they
   are the main milk source. However, there is a significant research gap
   in understanding their fertility traits. Unlike Holstein cows, which
   have been extensively studied, water buffaloes have received much less
   attention. This disparity means that key aspects of water buffalo
   genetics, particularly those influencing reproductive performance,
   remain underexplored. Current breeding programs for water buffaloes lag
   behind those for other dairy species due to this lack of genetic
   insight. Identifying candidate genes and pathways that affect
   reproductive traits—such as calving interval, age at first calving, and
   conception rates—could greatly enhance breeding strategies. Improving
   these traits would increase the number of calves produced and boost
   overall herd productivity. Addressing this research gap is essential
   for developing targeted breeding strategies that focus on enhancing
   reproductive performance, thereby unlocking the full potential of water
   buffalo dairy operations.

   In the association analysis of AFC, we identified a total of one
   candidate gene (GRM1). Fertility is a highly complex process involving
   the coordinated action of multiple signaling pathways. Among the
   identified genes, several are associated with key pathways, including
   the FOXO signaling pathway, calcium signaling pathway, and estrogen
   signaling pathway.

   Previous studies have indicated that the FOXO signaling pathway plays a
   crucial role in regulating cellular growth, development, cell cycle
   progression, and other functions, all of which contribute to
   reproductive efficiency [[151]32,[152]33,[153]34,[154]35]. We have
   identified the candidate gene GRM1 as playing a significant biological
   role within the FOXO signaling pathway. Specifically, GRM1
   (Metabotropic Glutamate Receptor 1) is involved in modulating various
   intracellular signaling cascades, including those related to oxidative
   stress response and apoptosis, both of which are critical for
   maintaining optimal reproductive performance [[155]36,[156]37]. For
   example, GRM1 can interact with key components of the FOXO signaling
   pathway, such as AKT and FOXO transcription factors, to regulate
   cellular processes [[157]38]. This interaction helps maintain cellular
   homeostasis and promotes efficient reproductive functions. By
   modulating the activity of AKT, GRM1 can influence the phosphorylation
   status of FOXO proteins, thereby controlling their nuclear localization
   and transcriptional activity. This regulatory mechanism ensures that
   cells respond appropriately to environmental cues and maintain proper
   reproductive function.

4.3. The Mechanism of SNP Mutation and Fertility Traits

   Among the selected SNPs, none were found to be non-synonymous. However,
   we identified base mutations in non-coding regions, specifically at
   position [158]NC_037554.1:20660384 within the GRM1 gene, which may
   potentially influence fertility-related traits. These findings suggest
   that variations in non-coding regions could play a crucial role in
   regulating important production and quality traits. Variability in
   non-transcriptional regulatory sequences, such as promoters, enhancers,
   and CTCF binding sites, is closely linked to the mechanisms underlying
   non-coding variation in cell development. However, whether these
   principles apply specifically to fertility traits requires further
   investigation [[159]39,[160]40].

4.4. Discussion of Tradeoffs

   In dairy cows, extensive research has demonstrated that selection for
   increased milk production often comes at the expense of reproductive
   performance. This phenomenon can be attributed to several factors,
   including the following: Energy Allocation: High-producing cows
   allocate a significant portion of their energy resources to milk
   synthesis, leaving less available for other physiological functions,
   such as reproduction. This can result in delayed first calving (AFC),
   reduced conception rates, and prolonged intervals between calvings
   [[161]41,[162]42,[163]43,[164]44,[165]45]. Nutritional Stress: Cows
   producing higher volumes of milk require more energy-dense diets to
   meet their nutritional needs. In environments where feed quality or
   availability is limited, breeding for lower milk production may be
   advantageous, as it allows for better allocation of energy to
   reproductive functions, thereby improving overall reproductive
   efficiency [[166]46,[167]47,[168]48]. Physiological Strain: The
   metabolic demands of high milk production can induce stress on the
   cow’s body, potentially leading to health issues such as ketosis,
   mastitis, and lameness, all of which can negatively impact reproductive
   performance [[169]49,[170]50,[171]51,[172]52].

   In areas where buffaloes are subjected to lower-quality feeds or
   seasonal variations in resource availability, breeding strategies
   should consider balancing milk production with reproductive efficiency.
   Selecting for moderate milk yields could enhance reproductive
   performance, ensuring sustainable herd management. Future studies
   should explore the genetic basis of these tradeoffs in buffaloes.
   Understanding the specific genes and pathways involved in both milk
   production and reproduction can help develop targeted breeding programs
   that optimize both traits. This makes it possible to develop integrated
   breeding programs that aim to balance milk production with reproductive
   performance, taking into account environmental constraints and feed
   availability.

4.5. Limitations and Future Directions

   While our study provides valuable insights into the reproductive
   performance of buffaloes, particularly focusing on the age at first
   calving (AFC), there are several limitations that need to be
   acknowledged. Addressing these limitations will help guide future
   research and improve the robustness of our findings. Limitations:
   Sample Size: Our pilot study included a limited number of buffaloes,
   which may have affected the generalizability of our results. A larger
   sample size would allow for more comprehensive statistical analyses and
   potentially uncover additional factors influencing AFC and reproductive
   efficiency [[173]3,[174]53]; Parity (Litter Number): The study did not
   account for variations in parity, which could influence AFC and
   subsequent reproductive performance. Including data on multiple
   parities in future studies would provide a more nuanced understanding
   of how parity affects reproductive traits over time [[175]54,[176]55];
   Calving Season: The timing of calving can significantly affect various
   reproductive parameters, including AFC. Our study did not control for
   or analyze the effects of different calving seasons. Future research
   should consider the seasonal variations in environmental conditions and
   management practices, as these factors can influence reproductive
   performance [[177]56,[178]57].

   To build on this pilot study, we propose the following next steps:
   Expand the Sample Size: Conducting a larger-scale study with a broader
   geographic distribution and increased sample size will enhance the
   reliability and external validity of our findings [[179]58];
   Comprehensive Reproductive Performance Analysis: Future work should
   incorporate a holistic approach to reproductive performance by
   integrating data on AFC, first lactation performance, parity, and
   calving season. This integrated analysis will provide a more
   comprehensive understanding of the factors influencing reproductive
   efficiency; and Longitudinal Studies: Implement longitudinal studies to
   track changes in reproductive performance over multiple calving cycles
   and across different seasons. Such studies will offer deeper insights
   into the long-term impacts of AFC and other reproductive traits.

   By addressing these limitations and pursuing these future directions,
   we aim to refine our understanding of the key determinants of
   reproductive performance in buffaloes and contribute to the development
   of more effective breeding and management strategies.

5. Conclusions

   In this investigation, a genome-wide association analysis was conducted
   to identify genetic factors associated with reproduction-related traits
   in domestic water buffaloes, with a particular focus on age at first
   calving (AFC). Our study identified a total of two significant SNP
   loci, and one candidate gene within a 50 Kb range surrounding these
   loci was selected for further investigation. The candidate gene was
   enriched in biological processes such as FOXO signaling pathway (GRM1),
   calcium signaling pathway (GRM1), and estrogen signaling pathway
   (GRM1), all of which are directly or indirectly involved in the
   reproduction process. These findings offer a reference framework for
   comprehending the genetic architecture underlying production and
   quality attributes in water buffaloes, thereby paving the way for
   subsequent biological validation of the implicated genes. This is
   crucial guidance for breeding and improvement programs in water
   buffaloes. In future studies, we will focus on elucidating the
   relationships between candidate genes and metabolites involved in the
   production process. Additionally, longitudinal studies and functional
   validations are needed to confirm the roles of these genes in different
   environmental and management contexts. The aim is to uncover the
   metabolic pathways and mechanisms through which the genes influence
   production and composition.

Acknowledgments