Abstract

   Background: Buffaloes are considered an indispensable genetic resource
   for dairy production. However, improvements in lactation performance
   have been relatively limited. Advances in sequencing technology,
   combined with genome-wide association studies, have facilitated the
   breeding of high-quality buffalo. Methods: We conducted an integrated
   analysis of genomic sequencing data from 120 water buffalo, the
   high-quality water buffalo genome assembly designated as UOA_WB_1, and
   milk production traits, including 305-day milk yield (MY), peak milk
   yield (PM), total protein yield (PY), protein percentage (PP), fat
   percentage (FP), and total milk fat yield (FY). Results: The results
   identified 56 significant SNPs, and based on these markers, 54
   candidate genes were selected. These candidate genes were significantly
   enriched in lactation-related pathways, such as the cAMP signaling
   pathway (ABCC4), TGF-β signaling pathway (LEFTY2), Wnt signaling
   pathway (CAMK2D), and metabolic pathways (DGAT1). Conclusions: These
   candidate genes (e.g., ABCC4, LEFTY2, CAMK2D, DGAT1) provide a
   substantial theoretical foundation for molecular breeding to enhance
   milk production in buffaloes.

   Keywords: buffalo, milk production traits, genome-wide association
   study

1. Introduction

   Water buffaloes hold immense significance in the dual domains of meat
   and milk production, with their milk making a considerable contribution
   to the global dairy sector. Specifically, buffalo milk accounts for an
   impressive share of over 15% of the global milk yield, underscoring the
   essential role of this species in the global dairy industry. This
   statistic highlights the importance of focusing on the welfare,
   productivity, and reproductive efficiency of buffaloes to ensure the
   continued growth and sustainability of this critical livestock sector
   [[40]1]. Buffalo milk is distinguished by its elevated levels of fat,
   protein, and minerals when compared to cattle milk, which contributes
   to its nutritional superiority and economic value. This nutritional
   profile is a key factor in the greater prevalence of buffaloes in Asia,
   where they outnumber many other livestock species. The unique
   composition of buffalo milk not only enhances its suitability for dairy
   products but also reinforces the strategic importance of buffaloes in
   the agricultural economies of the region [[41]2]. Furthermore, buffalo
   milk stands out in the market with economic significance, often
   fetching a price that is double that of bovine milk, reflecting its
   exceptional value. This distinct pricing advantage underscores the
   commercial viability of buffalo milk and its pivotal role in supporting
   the livelihoods and regional economies, distinguishing it from
   conventional milk sources. The premium price tag underscores the
   premium nutritional content and the potential for higher-quality dairy
   products derived from buffalo milk [[42]3]. However, when it comes to
   average milk production, even the most productive buffalo breeds yield
   significantly less than Holstein cows, as evidenced by tests conducted
   on milk yields. This discrepancy highlights the substantial difference
   in milk-producing capabilities between the two species [[43]4]. Indeed,
   increasing buffalo milk production while simultaneously improving its
   quality is a vital goal for fully realizing the economic potential of
   the buffalo milk industry. This objective, if achieved, would bring
   about significant benefits for the industry, making it an essential
   area of focus for researchers, farmers, and stakeholders.

   Due to a variety of factors, including unstable estrus conditions and
   protracted calving intervals, traditional breeding methods for
   buffaloes pose significant challenges. These factors complicate the
   breeding process and necessitate innovative approaches to enhance the
   efficiency and success of buffalo reproduction [[44]5]. Modern genetic
   technologies have introduced novel strategies for buffalo breeding,
   potentially simplifying the process. Among the various genetic
   variations, SNPs represent the most fundamental and prevalent form.
   Whole-genome sequencing is a powerful technique that enables the
   identification of genetic variations across the entire genome of an
   organism. This approach not only provides insights into the intrinsic
   genomic information but also plays a pivotal role in the study of human
   diseases, as well as in the breeding of crops and livestock [[45]6].
   Whole-genome association analysis serves as a potent tool for
   investigating complex genetic traits and pinpointing candidate genes.
   By scrutinizing genetic variations and polymorphisms throughout the
   entire genome, this method facilitates the identification of genomic
   regions and genes potentially associated with the traits of interest
   [[46]7]. Globally, in previous years, several studies were conducted
   which analyzed the correlation between SNPs and milk production traits.
   In their study, Islam et al. [[47]8] embarked on a comprehensive
   genome-wide association analysis involving 167 buffalo individuals,
   with the primary objective of discovering novel candidate genes linked
   to traits such as body weight and production performance. This analysis
   aimed to shed light on the genetic underpinnings of these economically
   important characteristics. The study employed the 90K Axiom Buffalo SNP
   Array to detect SNPs within a range of 2000 base pairs upstream of the
   buffalo FABP gene [[48]9]. They identified SNP sites within a specific
   region and performed an association analysis to examine the
   relationship between these SNPs and various milk-related traits.
   Despite the increasing interest in the application of whole-genome
   association analysis in studying milk-related traits in water
   buffaloes, the number of studies conducted in this area remains limited
   when compared to the extensive body of research in cattle. This
   highlights the need for further investigation in this field to deepen
   our understanding of the genetic factors underlying milk production in
   water buffaloes.

   The primary objective of this study was to identify key factors that
   influence lactation performance in water buffaloes, offering a new
   approach to enhancing breeding programs for improved milk production.
   By understanding these factors, we aim to provide valuable insights
   that will guide our future research and contribute to the development
   of more effective breeding strategies.

2. Materials and Methods

2.1. Ethics Statement

   All finished work was conducted in accordance with national and
   international guidelines. The protocol for this study was approved by
   the Attitude of the Animal Care & Welfare Committee of the Guangxi
   Buffalo Research Institute (Approval Code: GXU2019-021).

2.2. Phenotypes and Animal Resources

   The data for this study were collected from 120 water buffaloes,
   comprising 1 local Poyanghu water buffalo (DB), 46 hybrid water
   buffaloes (ZBs), 31 Murrah water buffaloes (MBs), and 42 Nili-Ravi
   water buffaloes (NBs), summing up to a total of 120. Among them, the
   hybrid water buffaloes are the offspring of more water buffaloes and
   local water buffaloes. These water buffaloes were born between the
   years 2000 and 2021. They were fed at the farm of Guangxi Buffalo
   Research Institute during the dry season from April to September. All
   records related to milk production are collected when all the water
   buffaloes are in their second calving. The initial test-day milk
   measurement is conducted from 5 to 70 days post-calving. The target
   traits of this study are 305-day milk yield (MY), peak milk yield (PM),
   total protein yield (PY), total milk fat yield (FY), fat percentage
   (FP), and protein percentage (PP). The calculation methods for PP and
   FP are as follows:
   [MATH: <mrow><mrow><mi>F</mi><mi>P</mi><mo>=</mo><mstyle
   scriptlevel="0"
   displaystyle="true"><mfrac><mrow><mi>F</mi><mi>Y</mi></mrow><mrow><mi>M
   </mi><mi>Y</mi></mrow></mfrac></mstyle></mrow></mrow> :MATH]
   [MATH: <mrow><mrow><mo> </mo><mi>P</mi><mi>P</mi><mo>=</mo><mstyle
   scriptlevel="0"
   displaystyle="true"><mfrac><mrow><mi>P</mi><mi>Y</mi></mrow><mrow><mi>M
   </mi><mi>Y</mi></mrow></mfrac></mstyle></mrow></mrow> :MATH]

2.3. Sample Collection and Sequencing

   Blood samples from water buffalo were obtained through tail vein
   puncture utilizing a vacuum blood collector. The genomic DNA was
   extracted from the blood using the phenol/chloroform method, and its
   integrity and yield were evaluated via agarose gel electrophoresis. The
   DNA libraries were sequenced on the Illumina sequencing platform
   (Illumina HiSeqTM 2000) by Genedenovo Biotechnology Co., Ltd.
   (Guangzhou, China).

2.4. Alignments and Variant Identification

   The clean reads were aligned to the reference genome (UOA_WB_1) using
   BWA-MEM (v0.7.17) with default settings [[49]10]. Then, Samtools
   (v1.9), Picard tools (v3.1.1), and GATK (v4.0) were used for SNP
   detection [[50]11,[51]12]. All detected SNPs underwent filtering
   through the “Variant Filtration” module of GATK, using the following
   standard parameters: variants with Quality Depth (QD) < 2; FS
   (Phred-scaled p-value using Fisher’s exact test for strand bias
   detection) > 60; MQRankSum (Z-score of the rank sum of the Phred-scaled
   mapping qualities) < −12.5; ReadPosRankSum (Z-score of the rank sum of
   the Phred-scaled position bias estimations) < −8; MQ (root mean square
   of the mapping quality) < 40.0; the mean sequencing depth of variants
   (across all individuals) was limited to less than 1/3× and more than
   3×; SOR (strand odds ratio) > 3.0; the maximum missing rate was less
   than 0.1; and SNPs were limited to two alleles.

2.5. Variation Filtering

   The presence of rare alleles (alleles with low frequency within the
   population), high rates of missing data, and substantial heterozygosity
   at specific loci can introduce anomalies in population analysis and
   whole-genome association studies. Therefore, we aligned the processed
   reads to the reference genome (UOA_WB_1). Subsequently, we employed the
   PLINK (v1.9) software to filter the detected loci based on standard
   criteria [[52]13]. The filtering process involved stringent adherence
   to several criteria: exclusion of non-biallelic SNPs, removal of those
   with a minor allele frequency below 0.05, discarding SNPs with a
   missing genotype rate exceeding 20%, and further limiting the analysis
   to SNPs with a heterozygosity ratio below the threshold of 0.8. This
   was all executed using the robust PLINK (v1.9) software.

2.6. Principal Component Analysis

   GCTA (v1.92.2) is a robust tool for the analysis of whole-genome
   complex traits [[53]14]. In this study, we utilized the GCTA (v1.92.2)
   and PLINK (v1.9) software to perform PCA (principal component analysis)
   using the selected SNP markers. This analysis enabled us to derive the
   variance accounted for by each PC (principal component) and the score
   matrix representing the samples’ positions within each PC.

2.7. Population Structure Analysis

   Population structure analysis offers valuable insights into the
   ancestry and composition of individuals, rendering it an exceptionally
   effective approach for elucidating genetic relationships. To validate
   the outcomes of PCA, we performed population structure analysis.
   Model-based population structure inference methods typically assume
   that the markers utilized for analysis are independent of one another.
   Consequently, prior to initiating the analysis, it is essential to
   execute marker independence filtering, which is based on the assessment
   of linkage disequilibrium between markers. In this analysis, we
   utilized the PLINK (v1.9) and Admixture software (v1.3) to perform
   marker filtering for population structure analysis. In our analysis, we
   implemented a 100 kb step size and a 10 nucleotide (nt) window size,
   and we removed one marker from each pair of markers with an r^2 value
   greater than 0.2. Specifically, we removed the marker with the higher
   physical position from each pair of markers with a high degree of
   linkage disequilibrium. As a result of implementing the aforementioned
   filtering strategy, we retained a total of 99,261 markers for the
   population structure analysis. Utilizing the filtered SNP markers, we
   conducted a principal component analysis (PCA) using PLINK to
   investigate the population structure and clustering patterns. The PCA
   results were visualized to illustrate the relationships among the first
   three principal components, providing insights into the genetic
   diversity and relatedness within the population. Additionally, we
   employed the Admixture software (v1.3) to perform an in-depth analysis
   of population structure, estimating the proportion of ancestry from K
   ancestral populations and identifying subpopulations within the
   dataset. In our analysis, we explored the cross-validation (CV) error
   for various k-values using the Admixture software (v1.3), ranging from
   2 to 9. We utilized the PopHelper software (v2.2.7) [[54]15] to
   generate bar plots illustrating the genetic composition of each sample
   within every subgroup. By systematically testing these k-value
   hypotheses, we aimed to identify the optimal number of clusters that
   would provide the most meaningful and informative partitioning of the
   studied populations.

2.8. Genome-Wide Association Mapping

   Our research focused on six primary dairy production traits: MY, PM,
   PY, FY, PP, and FP. By employing the TASSEL software (v5.2.54)
   [[55]16], we executed the widely used General Linear Model (GLM) (Q)
   for genome-wide association studies. After Bonferroni correction, sites
   with p-values less than the given threshold 0.05/N (number of SNP) were
   selected as significant sites. SNPs with p-values below this threshold
   were considered highly significant and selected for further analysis.
   Subsequently, these SNPs were compared against the reference genome to
   pinpoint candidate genes for further investigation. It is anticipated
   that the identification of these genes will significantly contribute to
   a breeding program for buffaloes, leading to enhancements in both milk
   production quantity and the quality of buffalo dairy products.

   Generalized Linear Models are a widely employed and versatile
   statistical method for data analysis [[56]17]. In the present study, we
   utilized Generalized Linear Models for conducting a genome-wide
   association analysis. The GLM (Q) analysis model is expressed as
   follows:
   [MATH:
   <mrow><mrow><mi>y</mi><mo>=</mo><mi>X</mi><mi>α</mi><mo>+</mo><mi>Q</mi
   ><mi>β</mi><mo>+</mo><mi>c</mi><mi>P</mi><mo>+</mo><mi
   mathvariant="normal">e</mi></mrow></mrow> :MATH]

   In this formula, y is the vector of phenotypes, X is the genotype
   matrix,
   [MATH: <mrow><mrow><mi>α</mi></mrow></mrow> :MATH]
   is the vector of genotype effects,
   [MATH: <mrow><mrow><mi>P</mi></mrow></mrow> :MATH]
   is the PCA variance_explained matrix,
   [MATH: <mrow><mrow><mi>c</mi></mrow></mrow> :MATH]
   is the vector of PCA variance effects, e is the vector of residual
   effects, and Q refers to the fixed-effect matrix, which represents calf
   gender, calving year, and herds. The outcomes are presented through
   Manhattan plots and Q-Q plots. SNPs exhibiting p-values below the
   specified threshold 0.05/N (number of SNP) are identified as highly
   significant SNPs. When a reference genome is available, candidate genes
   are determined by including those genes that are physically positioned
   within a 50 kb genomic region surrounding the significant SNPs. DbSNP
   [[57]18] is a database specifically designed by NCBI to store genetic
   variation information. We use dbSNP to determine whether the SNPs we
   have identified are located in the coding regions of genes.

2.9. Pathway Enrichment and Protein–Protein Interaction

   Genes often work in concert to perform specific biological functions.
   Pathway-based analysis is a valuable approach for understanding the
   roles of genes in these complex processes. The KEGG (Kyoto Encyclopedia
   of Genes and Genomes) database [[58]19] stands as one of the foremost
   publicly accessible resources for pathway-related data. To identify
   significantly enriched metabolic and signal transduction pathways among
   CAGs (Candidate-Associated Genes) relative to the entire genome
   context, pathway enrichment analysis was performed. The method for
   calculating enrichment is consistent with that employed in Gene
   Ontology (GO) [[59]20] analysis:
   [MATH:
   <mrow><mrow><mi>P</mi><mo>=</mo><mn>1</mn><mo>−</mo><mrow><munderover><
   mo
   stretchy="true">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>0</mn></mrow><mrow>
   <mi>m</mi><mo>−</mo><mn>1</mn></mrow></munderover></mrow><mstyle
   scriptlevel="0" displaystyle="true"><mfrac><mrow><mo>(</mo><mfrac
   linethickness="0pt"><mrow><mi>M</mi></mrow><mrow><mi>i</mi></mrow></mfr
   ac><mo>)</mo><mo>(</mo><mfrac
   linethickness="0pt"><mrow><mi>N</mi><mo>−</mo><mi>M</mi></mrow><mrow><m
   i>n</mi><mo>−</mo><mi>i</mi></mrow></mfrac><mo>)</mo></mrow><mrow><mo>(
   </mo><mfrac
   linethickness="0pt"><mrow><mi>N</mi></mrow><mrow><mi>n</mi></mrow></mfr
   ac><mo>)</mo></mrow></mfrac></mstyle><mo> </mo></mrow></mrow> :MATH]

   In this context, N signifies the total count of genes with KEGG
   annotations, while n denotes the number of CAGs within N. M represents
   the total number of genes annotated to particular pathways, and m is
   the number of CAGs in M. Following the calculation of the p-value, it
   was corrected using False Discovery Rate (FDR) adjustment, with an FDR
   value of 0.05 or less being set as the threshold. Pathways that meet
   this criterion are categorized as significantly enriched pathways in
   CAGs. Finally, we utilize the String database to identify genes that
   are significantly represented in pathways and create a protein–protein
   interaction map. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) analyses were performed using the OmicShare tools, a
   free online platform for data analysis
   ([60]http://www.omicshare.com/tools, accessed on 9 January 2025).

2.10. Statistical Analysis

   In the analysis of trait correlations, we used Pearson’s correlation
   coefficient to quantify the linear relationships between different
   traits. The Pearson’s correlation coefficient ranges from −1 to 1,
   where values close to 1 indicate a strong positive correlation, values
   close to −1 indicate a strong negative correlation, and values around 0
   indicate no significant correlation. Statistical analyses were
   performed using the SPSS 18.0 software package (SPSS Science, Chicago,
   IL, USA). Experimental data were subjected to t-test and ANOVA
   analyses, with a significance threshold set at p < 0.05. Graphs were
   generated using GraphPad Prism 8 software (GraphPad, Santiago, MN,
   USA). Data are presented as mean ± standard deviation (SD).

3. Results

3.1. Phenotypic Value Statistics of the Traits

   During the phenotypic evaluation of buffalo. We carried out MY, PM, PY,
   PP, FP, and FY phenotypic value analysis.

   For milk production traits, the mean value of MY was recorded as 2321.3
   kg, while the mean value of PM was measured as 11.4 kg. The mean value
   of FY was recorded as 116.1 kg, while the mean value of PY was measured
   as 80.8 kg. The average protein and fat percentages in our population
   were 4.8% and 5.2%, respectively. The phenotypic data statistics are
   presented in [61]Table 1.

Table 1.

   Statistical description of lactation traits *.
   Traits  Mean   SD    Min   Max
     MY   2321.3 861.8 480.8 5185.3
     PM    11.4   4.8   2.7   24.4
     PY    80.8  41.2   4.0  275.8
     FY   116.1  57.9   5.0  363.3
     PP    4.8    0.4   4.4   5.8
     FP    5.2    0.8   3.7   8.9
   [62]Open in a new tab

   * MY, milk yield; PM, peak milk yield; PY, protein yield; FY, fat
   yield; PP, protein percentage; FP, fat percentage; SD, standard
   deviation.

   Through the correlation analysis of various lactation traits, it was
   found that the correlation between total protein yield (PY) and total
   milk fat yield (FY) is the highest, reaching above 0.9 ([63]Figure 1).
   This strong positive correlation suggests a significant relationship
   between the production of protein and fat in milk, indicating that
   these two components tend to vary together in response to genetic or
   environmental factors.

Figure 1.

   [64]Figure 1
   [65]Open in a new tab

   Correlation analysis of various lactation traits. MY, milk yield; PM,
   peak milk yield; PY, protein yield; FY, fat yield; PP, protein
   percentage; FP, fat percentage.

   In this study, a comprehensive analysis of genome-wide variations led
   to the identification of 2,208,174 genetic markers. Among the detected
   genetic markers, 2,012,270 were identified as SNPs and 195,904 were
   classified as insertion–deletion (Indel) variants. Following stringent
   filtering criteria, a refined set of 99,261 markers was retained,
   comprising 93,494 SNPs and 5767 Indels.

3.2. Population Structure

   Upon obtaining PCA scores, the samples under investigation can be
   visualized via a scatter plot that utilizes the values of the first
   three principal components as axes. Referring to [66]Figure 2, in
   scatter plots [67]Figure 2A, it is evident that the majority of
   individuals within herds MB and NB are distinctly isolated from one
   another. In the scatter plot in [68]Figure 2B, we can observe
   clustering, particularly in groups MB and ZB, where these two clusters
   overlap and are closely grouped together in multiple instances. In
   [69]Figure 2A,C, it is evident that the herds are broadly segregated
   into three distinct clusters. One cluster predominantly consists of NB
   herds, whereas the other two clusters are primarily composed of MB, ZB,
   and DB herds, respectively.

Figure 2.

   [70]Figure 2
   [71]Open in a new tab

   The sample clustering obtained from PCA through three two-dimensional
   scatter plots, namely scatter (A), scatter (B), and scatter (C); scree
   plot (D). The percentage of variance explained by each PC is noted in
   parentheses. In the scatter plots, colored circles represent four
   different groups: DB, MB, NB, and ZB correspond to 1 DB, 42 MBs, 31
   NBs, and 46 ZBs, respectively.

   To establish the optimal number of clusters (k), we used the Admixture
   software and evaluated cross-validation error rates. The Admixture
   algorithm performs model-based clustering and estimates the proportion
   of ancestry from K ancestral populations. By minimizing the
   cross-validation error rates, we identified the value of k = 3 that
   best fits our data. [72]Figure 3 displays the line graph depicting the
   cross-validation error rate.

Figure 3.

   [73]Figure 3
   [74]Open in a new tab

   The line graph illustrating the cross-validation error rate is
   depicted, with the number of sample clusters delineated along the
   x-axis and the corresponding cross-validation error rate indicated on
   the y-axis.

   To simulate the population classification and genetic ancestry of each
   sample across varying numbers of subgroups (K = 2–9), we utilized the
   PopHelper software (v2.2.7) [[75]15] to generate bar plots illustrating
   the genetic composition of each sample within every subgroup. The
   results are presented in [76]Figure 4, where each color corresponds to
   a distinct cluster for each K-value. From the line graph of the
   cross-validation error rate, it is evident that the optimal number of
   clusters is K = 3. Similarly, as observed in the bar graph ([77]Figure
   3) depicting the genetic composition of the samples, when K = 3, it is
   the optimal number of clusters for these 120 buffaloes. This finding is
   consistent with the results obtained from the PCA. Consequently, we
   conclude that these 120 buffaloes can be effectively divided into three
   distinct subgroups.

Figure 4.

   [78]Figure 4
   [79]Open in a new tab

   Genetic bar chart illustration for K-means clustering with varying
   numbers of clusters (K = 2 to 9).

3.3. Results of the Genome-Wide Associations

   Following the calculation of p-values for the SNP loci using a
   Generalized Linear Model, we constructed a Manhattan plot and a Q-Q
   plot, as presented in [80]Figure 5.

Figure 5.

   [81]Figure 5
   [82]Open in a new tab

   Association analysis with milk production-related traits in water
   buffalo was conducted using the GLM-Q approach. The traits investigated
   include MY (A), PM (B), PY (C), FY (D), PP (E), and FP (F). The
   Manhattan plot on the left, created using the qqman package,
   illustrates the p-values for SNP markers across 25 chromosomes
   (comprising 24 autosomes and 1 X chromosome). The blue line delineating
   the Manhattan plot signifies the significance threshold, determined by
   0.05/N (number of SNP). Markers that surpass this threshold are deemed
   significant. The plot on the right is a Q-Q plot, where the x-axis
   denotes the observed values of the markers, and the y-axis represents
   the expected values, which have been transformed into the −10 log
   scale.

   The leftmost plot is a Manhattan plot, where the visually discernible
   blue line, running parallel to the x-axis, serves as the critical
   demarcation line. In the Manhattan plot, the points that rise above the
   threshold line, which is the blue line paralleling the x-axis, signify
   significant loci. Upon identifying the significant loci that surpass
   the threshold line in the Manhattan plot, the subsequent step involves
   documenting the relevant information pertaining to these significant
   loci.

   The plot on the right is the Q-Q plot, where the points in the bottom
   left corner fall along the line, suggesting that the observed p-values
   align closely with the expected values. The points exhibit a distinct
   upward deviation from the diagonal in the upper right corner, which
   signifies that the observed p-values exceed the anticipated values. The
   presence of these points, which denote significant loci, across all
   four plots underscores the validity of the analytical model, suggesting
   its appropriateness in capturing the underlying patterns. Following
   this, the genes situated within a 50 kb range of the significant loci
   are carefully selected and earmarked as candidate genes.

   In our GWAS, we have identified a large number of SNPs associated with
   FY and PY ([83]Figure 5). We identified 56 statistically significant
   SNPs and 54 candidate genes within a 50 Kb range surrounding these loci
   that were associated with the traits MY, PY, PP, FP, and FY ([84]Table
   2).

Table 2.

   The SNPs identified via genome-wide association analysis encompass
   detailed information regarding their chromosomal locations, p-values,
   and associated candidate genes.
   Traits SNP Chr Pos p R^2 Candidate Genes
   MY 1 [85]NC_037552.1 110795896 4.3 × 10^−7 0.39 CNTNAP2
   PY 2 [86]NC_037545.1 45287655 4.01 × 10^−7 0.36 --
   PY 3 [87]NC_037545.1 45287667 4.01 × 10^−7 0.36 --
   PY 4 [88]NC_037545.1 45287677 4.01 × 10^−7 0.36 --
   PY 5 [89]NC_037545.1 45287689 4.01 × 10^−7 0.36 --
   PY 6 [90]NC_037545.1 155842848 6.41 × 10^−7 0.43 KCNAB1
   PY 7 [91]NC_037546.1 174939089 1.82 × 10^−7 0.37 TINAGL1; AZIN2
   PY 8 [92]NC_037547.1 10261615 8.43 × 10^−8 0.35 RBFOX3
   PY 9 [93]NC_037548.1 58828968 3.56 × 10^−7 0.33 NEDD1
   PY 10 [94]NC_037548.1 97643871 9.64 × 10^−8 0.369 EEA1; PLEKHG7
   PY 11 [95]NC_037549.1 52254091 4.08 × 10^−7 0.31 LEFTY2; PYCR2; -
   PY 12 [96]NC_037549.1 122104312 8.972 × 10^−8 0.41 DOC2G; NUDT8; TBX10;
   ALDH3B1; UNC93B1; ALDH3B1; NDUFS8; TCIRG1
   PY 13 [97]NC_037550.1 7990287 2.51 × 10^−7 0.34 CFAP126; SDHC
   PY 14 [98]NC_037550.1 109178670 1.5 × 10^−9 0.48 MAP7D1; TRAPPC3;
   COL8A; ADPRHL2; TEKT2
   PY 15 [99]NC_037551.1 62424439 7.20 × 10^−11 0.54 --
   PY 16 [100]NC_037551.1 87385802 1.7 × 10^−9 0.43 CAMK2D
   PY 17 [101]NC_037552.1 119759448 4.64 × 10^−7 0.30 USP17L13
   PY 18 [102]NC_037556.1 9792358 3.18 × 10^−7 0.36 --
   PY 19 [103]NC_037557.1 18868198 1.71 × 10^−7 0.34 ABCC4
   PY 20 [104]NC_037557.1 18868200 1.72 × 10^−7 0.34 ABCC4
   PY 21 [105]NC_037557.1 18868226 1.71 × 10^−7 0.34 ABCC4
   PY 22 [106]NC_037557.1 18868442 2.02 × 10^−7 0.34 ABCC4
   PY 23 [107]NC_037557.1 18868443 2.02 × 10^−7 0.34 ABCC4
   PY 24 [108]NC_037557.1 18868449 2.02 × 10^−7 0.34 ABCC4
   PY 25 [109]NC_037557.1 18868500 1.71 × 10^−7 0.34 ABCC4
   PY 26 [110]NC_037557.1 18868507 1.71 × 10^−7 0.34 ABCC4
   PY 27 [111]NC_037557.1 18868523 1.71 × 10^−7 0.34 ABCC4
   PY 28 [112]NC_037560.1 28753781 9.90 × 10^−8 0.40 GUCY2D; LRRC32
   PY 29 [113]NC_037560.1 35600620 3.36 × 10^−8 0.40 OR52Z1; OR51V1;
   OR51V1; OR52A5; OR52K1; OR52K1
   PY 30 [114]NC_037560.1 60745000 4.5 × 10^−10 0.44 TMPRSS5
   PY 31 [115]NC_037560.1 74611905 1.48 × 10^−8 0.38 CNTN5
   PY 32 [116]NC_037562.1 12469915 1.04 × 10^−8 0.43 --
   PY 33 [117]NC_037564.1 56654985 8.12 × 10^−8 0.39 --
   PY 34 [118]NC_037564.1 56655021 8.12 × 10^−8 0.39 --
   PY 35 [119]NC_037565.1 13765148 1.19 × 10^−7 0.37 CTNNB1
   PY 36 [120]NC_037566.1 61557795 1.04 × 10^−7 0.37 KCNG2; PQLC1; TXNL4A;
   YVCT
   PY 37 [121]NC_037567.1 17641416 1.31 × 10^−8 0.39 TLL2; TM9SF3
   PY 38 [122]NC_037567.1 17641470 1.91 × 10^−8 0.39 TLL2; TM9SF3
   PY 39 [123]NC_037567.1 17641671 1.31 × 10^−8 0.39 TLL2; TM9SF3
   PY 40 [124]NC_037567.1 44281645 8.32 × 10^−8 0.34 --
   FY 41 [125]NC_037550.1 109178670 8.94 × 10^−8 0.37 MAP7D1; TRAPPC3;
   COL8A; ADPRHL2; TEKT2
   FY 42 [126]NC_037551.1 62424439 1.7 × 10^−8 0.42 --
   FY 43 [127]NC_037557.1 18651732 2.48 × 10^−7 0.32 ABCC4
   FY 44 [128]NC_037560.1 60745000 4.44 × 10^−7 0.32 TMPRSS5
   FY 45 [129]NC_037560.1 74050792 3.67 × 10^−7 0.36 --
   PP 46 [130]NC_037556.1 50669172 3.34 × 10^−8 0.43 --
   FP 47 [131]NC_037545.1 35877328 4.11 × 10^−7 0.30 --
   FP 48 [132]NC_037546.1 6705158 3.44 × 10^−7 0.27 --
   FP 49 [133]NC_037548.1 4405934 6.25 × 10^−8 0.33 FAM118A; UPK3A;
   KIAA0930
   FP 50 [134]NC_037552.1 24460 6.65 × 10^−8 0.29 --
   FP 51 [135]NC_037552.1 112491377 3.27 × 10^−8 0.32 ZNF777; ZNF746
   FP 52 [136]NC_037552.1 113607118 2.36 × 10^−8 0.33 ABCB8; ASIC3
   FP 53 [137]NC_037555.1 29774524 2.45 × 10^−7 0.28 PRKCH
   FP 54 [138]NC_037559.1 81684074 5.71 × 10^−8 0.32 DGAT1; HSF1
   FP 55 [139]NC_037564.1 36725181 9.35 × 10^−8 0.31 LINGO1
   FP 56 [140]NC_037569.1 5832499 2.09 × 10^−8 0.35 KAL1
   [141]Open in a new tab

3.4. Kyoto Encyclopedia of Genes and Genomes Pathway Analysis of Candidate
Genes

   GO and KEGG analyses were conducted to examine the functional pathways
   for DEGs. The top 20 GO terms, classified by –log10(p-value), were
   significantly enriched in candidate genes compared to the genome
   background GO. The top 20 GO terms included eight biological processes
   (BPs), a cellular component (CC) and one molecular function (MF). The
   BP terms were enriched in growth (GO:004007), developmental processes
   (GO:0032502), cellular processes (GO:0009987), metabolic processes
   (GO:0008152), and cellular component organization or biogenesis
   (GO:0071840) ([142]Figure 6A, [143]Supplementary Table S1).

Figure 6.

   [144]Figure 6
   [145]Open in a new tab

   GO and KEGG analysis of candidate genes. (A) GO bar plot diagram
   showing the top 20 enriched GO terms. GO categories, including cellular
   component, biological process, and molecular function. (B) The
   enrichment circle diagram shows the KEGG analysis of the top 20
   pathways. Four circles from the outside to the inside. First circle:
   the classification of enrichment; outside the circle is the scale of
   the number of genes. Different colors represent different categories.
   Second circle: number and p-values of the classification in the
   background genes. The more genes, the longer the bars; the smaller the
   value, the redder the color. Third circle: bar chart of the total
   number of candidate genes. Fourth circle: rich factor value of each
   classification (number of candidate genes in this classification
   divided by the number of background genes). Each cell of the background
   helper line represents 0.1, and the color coding signifies the
   statistical significance of the corresponding enrichment.

   The functional enrichment cycle diagram displays the top 20 KEGG
   pathways ([146]Figure 6B, [147]Supplementary Tables S2 and S3),
   classified by –log10 (p-value), which reveals that the candidate genes
   were mainly enriched in five KEGG_A_class pathways, including
   metabolism, environmental information processing, cellular processes,
   organismal systems, and human diseases. Among these pathways,
   environmental information processing included ABC transporters
   (ko02010), the cAMP signaling pathway (ko04024), the TGF-β signaling
   pathway (ko04350), and the wnt signaling pathway (ko04310); metabolism
   included arginine and proline metabolism (ko00330) and phenylalanine
   metabolism (ko00360); biosynthesis of amino acids (ko01230) and
   metabolic pathways was involved (ko01100); and cellular processes
   included signaling pathways regulating pluripotency of stem cells
   (ko04550). In conclusion, our data suggest that the identified
   candidate genes play a crucial role in regulating lactation
   performance, particularly in terms of milk fat yield (FY), protein
   percentage(PP), fat percentage (FP), and protein yield (PY), by
   modulating the tgf-β signaling pathway, wnt signaling pathway,
   metabolic pathways, and cAMP signaling pathway. These findings provide
   valuable insights into the molecular mechanisms underlying dairy
   production and could inform future breeding strategies to enhance milk
   quality and quantity.

3.5. Significant Association of Milk Protein Content with SNP Validation

   [148]Table 3 presents the results of individual genotyping for four key
   loci in water buffalo: [149]NC_037557.1:18868198 (ABCC4),
   [150]NC_037549.1:52254091 (LEFTY2), [151]NC_037559.1:81684074 (DGAT1),
   and [152]NC_037551.1:87385802 (CAMK2D). The analysis reveals
   significant differences in milk protein content across different
   genotypes at these loci.

Table 3.

   The results of individual genotyping *.
   Candidate Genes SNP (Chr:Pos) Milk Protein Yield
   Homozygous Mutation Heterozygous Mutation Reference Genotype
   ABCC4 [153]NC_037557.1:18868198 A/A G/A G/G
   73.46 ± 29.68C 103.97 ± 47.99B 144.57 ± 65.04A
   LEFTY2 [154]NC_037549.1:52254091 A/A G/A G/G
   73.87 ± 31.82B 97.07 ± 19.41B 151.03 ± 74.08A
   DGAT1 [155]NC_037559.1:81684074 T/T C/T C/C
   67.01 ± 38.53B 90.62 ± 45.36B 125.61 ± 71.25A
   CAMK2D [156]NC_037551.1:87385802 C/C C/T T/T
   62.03 ± 48.35C 95.85 ± 58.65B 133.24 ± 79.52A
   [157]Open in a new tab

   * The phenotypic values of milk protein content are expressed as “least
   squares mean ± standard deviation”. Different letters in the same
   column indicate significant differences (p < 0.05); the same letter or
   no letter indicates no significant difference (p > 0.05). The genotypes
   of SNP loci are arranged in the order of homozygous mutant,
   heterozygous, and reference types.

   For ABCC4 ([158]NC_037557.1:18868198), the A/A genotype is associated
   with significantly lower milk protein yield compared to the G/A and G/G
   genotypes.

   For LEFTY2 ([159]NC_037549.1:52254091), both the A/A and G/A genotypes
   exhibit significantly lower milk protein yield compared to the G/G
   genotype.

   For DGAT1 ([160]NC_037559.1:81684074), the T/T and C/T genotypes are
   associated with significantly lower milk protein yield compared to the
   C/C genotype.

   For CAMK2D ([161]NC_037551.1:87385802), the T/T genotype shows
   significantly lower milk protein yield compared to both the CT and C/C
   genotypes.

   These findings highlight the influence of specific genotypes on milk
   protein yield, providing valuable insights for genetic selection and
   breeding programs aimed at improving milk quality in water buffalo.

4. Discussion

4.1. Population Stratification

   Population stratification is a pivotal factor in GWASs, as it
   significantly influences the findings due to disparities in ancestral
   origins. These differences can give rise to discrepancies in allele
   frequencies across populations, potentially generating spurious
   association signals [[162]21,[163]22]. To mitigate the risk of
   false-positive findings in our analysis, it is imperative to
   acknowledge the existence of population stratification. PCA is capable
   of diminishing the complexity of a dataset while keeping its covariance
   structure intact [[164]23]. To assess the classification effectiveness
   of the first 10 principal components, we calculated the proportion of
   variance explained by each component using the filtered SNP markers
   with PLINK. This analysis helped us understand how well the principal
   components capture the genetic variation within the population.
   Furthermore, we used the Admixture software to perform K-cluster
   analysis and evaluated the cross-validation error (CV-error) to
   determine the optimal number of subpopulations. By minimizing the
   CV-error, we identified the most appropriate value of K for population
   stratification. Though these buffaloes are raised on the same farm,
   their origins differ: the Nili-Ravi buffalo hails from Pakistan, while
   the Murrah buffalo originates from India. From our PCA plot, we observe
   clustering, particularly between groups MB and ZB, where these two
   clusters overlap and coalesce in several instances. This indicates that
   the buffaloes in these two groups may have a closer genetic
   relationship compared to the other groups. Overall, these groups can be
   generally categorized into two distinct segments. The plot depicting
   cross-validation error suggests that the optimal value of K, which
   corresponds to the lowest cross-validation error, is 3. Therefore, it
   is concluded that these 120 water buffaloes should be divided into
   three subgroups.

   The simple linear model serves as a valuable instrument for conducting
   SNP and phenotype analysis, concurrently managing population
   stratification by incorporating the relevant population structure as a
   covariate [[165]24]. Typically, the models used in GWAS analysis are
   adjusted based on the genetic background and stratification of the
   dataset. In general, after correcting for population stratification,
   the inflation factor should approach a value of 1 in instances that
   adhere to a normal distribution [[166]24]. In our experimental
   outcomes, the inflation factor for MY was 1.056, that for PM was 1.1,
   that for PY was 0909, that for PP was 1.06, that for FP was 1.12, and
   that for total FY was 0.916. These findings suggest that our model’s
   adjustment for population stratification is valid.

4.2. Genome-Wide Association Analysis of Milk Production-Related Traits

   Enhancing milk production and quality in water buffalo has emerged as a
   critical research priority within the dairy industry. Water buffalo
   play a vital role in global dairy production, particularly in regions
   where they are the primary milk source. However, there is a significant
   research gap in understanding the lactation traits of water buffalo.
   Unlike Holstein cows, which have been extensively studied through
   detailed production records and numerous genome-wide association
   studies (GWASs), water buffalo have received considerably less
   attention in these areas. This disparity in research means that key
   aspects of water buffalo genetics, such as the identification of
   candidate genes and pathways influencing milk yield, protein content,
   and fat composition, remain underexplored. As a result, breeding
   programs for water buffalo are not as advanced or effective as those
   for other dairy species, limiting the potential for improving milk
   production and quality. Addressing this research gap is essential for
   developing targeted breeding strategies and enhancing the overall
   productivity of water buffalo dairy operations.

   For milk production traits, this investigation assessed and quantified
   the MY as well as the PM. The mean values were 2321.3 kg and 11.4 kg,
   respectively. This investigation also assessed and quantified the PY,
   PP, FP, and FY. The mean values were 80.8 kg, 4.8%, 5.2%, and 116.1 kg,
   respectively. The mean milk yield (MY) of our buffalo population is
   lower than that of Mediterranean buffaloes (2321.3 kg) but higher than
   that of Brazilian buffaloes (1578.90 kg). In contrast, the mean values
   for peak milk yield (PM), fat yield (FY), and protein yield (PY) are
   more closely aligned with those of Mediterranean buffaloes
   [[167]25,[168]26]. Therefore, we are confident that our production
   record measurements are within the normal range and can be converted
   into corresponding phenotype vectors in the model. We carried out an
   association analysis for each of the phenotypic traits.

   In the association analysis of PY and FY, we identified a total of 54
   candidate genes. Lactation is a highly complex process involving the
   coordinated action of multiple signaling pathways. Among the identified
   genes, several are associated with key pathways, including the cAMP
   signaling pathway (ABCC4), TGF-β signaling pathway (LEFTY2), Wnt
   signaling pathway (CAMK2D), and metabolic pathways (DGAT1).

   The ABCC4 gene, located on chromosome [169]NC_037557.1, encodes
   ATP-binding cassette (ABC) transporters. These proteins are essential
   for modulating platelet aggregation, a critical process in blood
   clotting and hemostasis [[170]27]. Moreover, ABCC4 has been detected in
   milk from the early stages of lactation [[171]28]. Our findings
   indicate that ABCC4 may regulate milk production by influencing the
   cAMP signaling pathway, suggesting its potential importance in
   lactation biology.

   The TGF-β signaling pathway (LEFTY2) plays a pivotal role in mammary
   gland development and lactation [[172]29,[173]30,[174]31]. Studies have
   shown that the TGF-β pathway not only regulates the proliferation and
   differentiation of mammary epithelial cells but also influences the
   synthesis and secretion of milk components. Specifically, TGF-β
   promotes the self-renewal of mammary stem cells while inhibiting
   excessive cell proliferation, ensuring the proper development and
   function of mammary tissue.

   The Wnt signaling pathway is a complex regulatory network that plays a
   crucial role in controlling cell growth, differentiation, and tissue
   development. In particular, breast development is intricately linked to
   lactation performance, as the proper formation and function of the
   mammary gland are essential for efficient milk production. Extensive
   previous research has shown that the gene expression and epigenetic
   regulation of CAMK2D (Calcium/Calmodulin-Dependent Protein Kinase II
   Delta) significantly impact the physiological functions of the mammary
   gland. CAMK2D is involved in various cellular processes, including
   calcium signaling and gene transcription, which are critical for
   mammary gland development and lactation [[175]32,[176]33].

   Milk fat provides essential fatty acids (FAs) that can contribute to
   various health benefits, such as supporting cardiovascular health,
   enhancing nutrient absorption, and promoting overall well-being,
   depending on their specific composition. Previous research has
   demonstrated that polymorphisms in the DGAT1 (Diacylglycerol
   O-Acyltransferase 1) gene influence various milk production traits,
   including milk yield, fat yield, protein yield, and the fat and protein
   content of milk [[177]34,[178]35,[179]36]. Consistent with these
   findings, our study revealed that DGAT1 polymorphisms have a
   significant impact on fat content, further highlighting the importance
   of this gene in determining milk composition.

4.3. The Mechanism of SNP Mutation and the Milk Production Traits

   Among the selected SNPs, none are non-synonymous. However, we
   identified base mutations in non-coding regions, specifically at
   positions [180]NC_037557.1:18868198, [181]NC_037549.1:52254091,
   [182]NC_037559.1:81684074, and [183]NC_037551.1:87385802, which may
   potentially influence milk production-related traits. These results
   suggest that variation in non-coding regions may play an important role
   in the regulation of these important production and quality traits. The
   variability of non-transcriptional regulatory sequences (e.g.,
   promoters, enhancers, CTCF sites) is closely related to the mechanism
   of non-coding variation in cell development, but whether this rule
   applies to milk production traits needs further verification
   [[184]37,[185]38].

5. Conclusions

   In this investigation, genome-wide association analysis was performed
   on four milk production-related characteristics in domestic water
   buffaloes, which included MY, PM, PY, and FY. A total of 56 significant
   SNP loci were identified, and 54 candidate genes within a 50 Kb range
   surrounding these loci were selected. These candidate genes were
   enriched in biological processes such as the cAMP signaling pathway
   (ABCC4), the TGF-β signaling pathway (LEFTY2), the Wnt signaling
   pathway (CAMK2D), and metabolic pathways (DGAT1), all of which are
   directly or indirectly involved in the lactation process. These
   findings offer a reference framework for comprehending the genetic
   architecture underlying milk production and quality attributes in water
   buffaloes, thereby paving the way for subsequent biological validation
   of the implicated genes. This is crucial guidance for breeding and
   improvement programs in water buffaloes. In future studies, we will
   focus on elucidating the relationships between candidate genes and
   metabolites involved in the lactation process to uncover the metabolic
   pathways and mechanisms through which these genes influence milk
   production and composition.

Acknowledgments