Abstract
Background: Buffaloes are considered an indispensable genetic resource
for dairy production. However, improvements in lactation performance
have been relatively limited. Advances in sequencing technology,
combined with genome-wide association studies, have facilitated the
breeding of high-quality buffalo. Methods: We conducted an integrated
analysis of genomic sequencing data from 120 water buffalo, the
high-quality water buffalo genome assembly designated as UOA_WB_1, and
milk production traits, including 305-day milk yield (MY), peak milk
yield (PM), total protein yield (PY), protein percentage (PP), fat
percentage (FP), and total milk fat yield (FY). Results: The results
identified 56 significant SNPs, and based on these markers, 54
candidate genes were selected. These candidate genes were significantly
enriched in lactation-related pathways, such as the cAMP signaling
pathway (ABCC4), TGF-β signaling pathway (LEFTY2), Wnt signaling
pathway (CAMK2D), and metabolic pathways (DGAT1). Conclusions: These
candidate genes (e.g., ABCC4, LEFTY2, CAMK2D, DGAT1) provide a
substantial theoretical foundation for molecular breeding to enhance
milk production in buffaloes.
Keywords: buffalo, milk production traits, genome-wide association
study
1. Introduction
Water buffaloes hold immense significance in the dual domains of meat
and milk production, with their milk making a considerable contribution
to the global dairy sector. Specifically, buffalo milk accounts for an
impressive share of over 15% of the global milk yield, underscoring the
essential role of this species in the global dairy industry. This
statistic highlights the importance of focusing on the welfare,
productivity, and reproductive efficiency of buffaloes to ensure the
continued growth and sustainability of this critical livestock sector
[[40]1]. Buffalo milk is distinguished by its elevated levels of fat,
protein, and minerals when compared to cattle milk, which contributes
to its nutritional superiority and economic value. This nutritional
profile is a key factor in the greater prevalence of buffaloes in Asia,
where they outnumber many other livestock species. The unique
composition of buffalo milk not only enhances its suitability for dairy
products but also reinforces the strategic importance of buffaloes in
the agricultural economies of the region [[41]2]. Furthermore, buffalo
milk stands out in the market with economic significance, often
fetching a price that is double that of bovine milk, reflecting its
exceptional value. This distinct pricing advantage underscores the
commercial viability of buffalo milk and its pivotal role in supporting
the livelihoods and regional economies, distinguishing it from
conventional milk sources. The premium price tag underscores the
premium nutritional content and the potential for higher-quality dairy
products derived from buffalo milk [[42]3]. However, when it comes to
average milk production, even the most productive buffalo breeds yield
significantly less than Holstein cows, as evidenced by tests conducted
on milk yields. This discrepancy highlights the substantial difference
in milk-producing capabilities between the two species [[43]4]. Indeed,
increasing buffalo milk production while simultaneously improving its
quality is a vital goal for fully realizing the economic potential of
the buffalo milk industry. This objective, if achieved, would bring
about significant benefits for the industry, making it an essential
area of focus for researchers, farmers, and stakeholders.
Due to a variety of factors, including unstable estrus conditions and
protracted calving intervals, traditional breeding methods for
buffaloes pose significant challenges. These factors complicate the
breeding process and necessitate innovative approaches to enhance the
efficiency and success of buffalo reproduction [[44]5]. Modern genetic
technologies have introduced novel strategies for buffalo breeding,
potentially simplifying the process. Among the various genetic
variations, SNPs represent the most fundamental and prevalent form.
Whole-genome sequencing is a powerful technique that enables the
identification of genetic variations across the entire genome of an
organism. This approach not only provides insights into the intrinsic
genomic information but also plays a pivotal role in the study of human
diseases, as well as in the breeding of crops and livestock [[45]6].
Whole-genome association analysis serves as a potent tool for
investigating complex genetic traits and pinpointing candidate genes.
By scrutinizing genetic variations and polymorphisms throughout the
entire genome, this method facilitates the identification of genomic
regions and genes potentially associated with the traits of interest
[[46]7]. Globally, in previous years, several studies were conducted
which analyzed the correlation between SNPs and milk production traits.
In their study, Islam et al. [[47]8] embarked on a comprehensive
genome-wide association analysis involving 167 buffalo individuals,
with the primary objective of discovering novel candidate genes linked
to traits such as body weight and production performance. This analysis
aimed to shed light on the genetic underpinnings of these economically
important characteristics. The study employed the 90K Axiom Buffalo SNP
Array to detect SNPs within a range of 2000 base pairs upstream of the
buffalo FABP gene [[48]9]. They identified SNP sites within a specific
region and performed an association analysis to examine the
relationship between these SNPs and various milk-related traits.
Despite the increasing interest in the application of whole-genome
association analysis in studying milk-related traits in water
buffaloes, the number of studies conducted in this area remains limited
when compared to the extensive body of research in cattle. This
highlights the need for further investigation in this field to deepen
our understanding of the genetic factors underlying milk production in
water buffaloes.
The primary objective of this study was to identify key factors that
influence lactation performance in water buffaloes, offering a new
approach to enhancing breeding programs for improved milk production.
By understanding these factors, we aim to provide valuable insights
that will guide our future research and contribute to the development
of more effective breeding strategies.
2. Materials and Methods
2.1. Ethics Statement
All finished work was conducted in accordance with national and
international guidelines. The protocol for this study was approved by
the Attitude of the Animal Care & Welfare Committee of the Guangxi
Buffalo Research Institute (Approval Code: GXU2019-021).
2.2. Phenotypes and Animal Resources
The data for this study were collected from 120 water buffaloes,
comprising 1 local Poyanghu water buffalo (DB), 46 hybrid water
buffaloes (ZBs), 31 Murrah water buffaloes (MBs), and 42 Nili-Ravi
water buffaloes (NBs), summing up to a total of 120. Among them, the
hybrid water buffaloes are the offspring of more water buffaloes and
local water buffaloes. These water buffaloes were born between the
years 2000 and 2021. They were fed at the farm of Guangxi Buffalo
Research Institute during the dry season from April to September. All
records related to milk production are collected when all the water
buffaloes are in their second calving. The initial test-day milk
measurement is conducted from 5 to 70 days post-calving. The target
traits of this study are 305-day milk yield (MY), peak milk yield (PM),
total protein yield (PY), total milk fat yield (FY), fat percentage
(FP), and protein percentage (PP). The calculation methods for PP and
FP are as follows:
[MATH: FP=FYM
Y :MATH]
[MATH: PP=PYM
Y :MATH]
2.3. Sample Collection and Sequencing
Blood samples from water buffalo were obtained through tail vein
puncture utilizing a vacuum blood collector. The genomic DNA was
extracted from the blood using the phenol/chloroform method, and its
integrity and yield were evaluated via agarose gel electrophoresis. The
DNA libraries were sequenced on the Illumina sequencing platform
(Illumina HiSeqTM 2000) by Genedenovo Biotechnology Co., Ltd.
(Guangzhou, China).
2.4. Alignments and Variant Identification
The clean reads were aligned to the reference genome (UOA_WB_1) using
BWA-MEM (v0.7.17) with default settings [[49]10]. Then, Samtools
(v1.9), Picard tools (v3.1.1), and GATK (v4.0) were used for SNP
detection [[50]11,[51]12]. All detected SNPs underwent filtering
through the “Variant Filtration” module of GATK, using the following
standard parameters: variants with Quality Depth (QD) < 2; FS
(Phred-scaled p-value using Fisher’s exact test for strand bias
detection) > 60; MQRankSum (Z-score of the rank sum of the Phred-scaled
mapping qualities) < −12.5; ReadPosRankSum (Z-score of the rank sum of
the Phred-scaled position bias estimations) < −8; MQ (root mean square
of the mapping quality) < 40.0; the mean sequencing depth of variants
(across all individuals) was limited to less than 1/3× and more than
3×; SOR (strand odds ratio) > 3.0; the maximum missing rate was less
than 0.1; and SNPs were limited to two alleles.
2.5. Variation Filtering
The presence of rare alleles (alleles with low frequency within the
population), high rates of missing data, and substantial heterozygosity
at specific loci can introduce anomalies in population analysis and
whole-genome association studies. Therefore, we aligned the processed
reads to the reference genome (UOA_WB_1). Subsequently, we employed the
PLINK (v1.9) software to filter the detected loci based on standard
criteria [[52]13]. The filtering process involved stringent adherence
to several criteria: exclusion of non-biallelic SNPs, removal of those
with a minor allele frequency below 0.05, discarding SNPs with a
missing genotype rate exceeding 20%, and further limiting the analysis
to SNPs with a heterozygosity ratio below the threshold of 0.8. This
was all executed using the robust PLINK (v1.9) software.
2.6. Principal Component Analysis
GCTA (v1.92.2) is a robust tool for the analysis of whole-genome
complex traits [[53]14]. In this study, we utilized the GCTA (v1.92.2)
and PLINK (v1.9) software to perform PCA (principal component analysis)
using the selected SNP markers. This analysis enabled us to derive the
variance accounted for by each PC (principal component) and the score
matrix representing the samples’ positions within each PC.
2.7. Population Structure Analysis
Population structure analysis offers valuable insights into the
ancestry and composition of individuals, rendering it an exceptionally
effective approach for elucidating genetic relationships. To validate
the outcomes of PCA, we performed population structure analysis.
Model-based population structure inference methods typically assume
that the markers utilized for analysis are independent of one another.
Consequently, prior to initiating the analysis, it is essential to
execute marker independence filtering, which is based on the assessment
of linkage disequilibrium between markers. In this analysis, we
utilized the PLINK (v1.9) and Admixture software (v1.3) to perform
marker filtering for population structure analysis. In our analysis, we
implemented a 100 kb step size and a 10 nucleotide (nt) window size,
and we removed one marker from each pair of markers with an r^2 value
greater than 0.2. Specifically, we removed the marker with the higher
physical position from each pair of markers with a high degree of
linkage disequilibrium. As a result of implementing the aforementioned
filtering strategy, we retained a total of 99,261 markers for the
population structure analysis. Utilizing the filtered SNP markers, we
conducted a principal component analysis (PCA) using PLINK to
investigate the population structure and clustering patterns. The PCA
results were visualized to illustrate the relationships among the first
three principal components, providing insights into the genetic
diversity and relatedness within the population. Additionally, we
employed the Admixture software (v1.3) to perform an in-depth analysis
of population structure, estimating the proportion of ancestry from K
ancestral populations and identifying subpopulations within the
dataset. In our analysis, we explored the cross-validation (CV) error
for various k-values using the Admixture software (v1.3), ranging from
2 to 9. We utilized the PopHelper software (v2.2.7) [[54]15] to
generate bar plots illustrating the genetic composition of each sample
within every subgroup. By systematically testing these k-value
hypotheses, we aimed to identify the optimal number of clusters that
would provide the most meaningful and informative partitioning of the
studied populations.
2.8. Genome-Wide Association Mapping
Our research focused on six primary dairy production traits: MY, PM,
PY, FY, PP, and FP. By employing the TASSEL software (v5.2.54)
[[55]16], we executed the widely used General Linear Model (GLM) (Q)
for genome-wide association studies. After Bonferroni correction, sites
with p-values less than the given threshold 0.05/N (number of SNP) were
selected as significant sites. SNPs with p-values below this threshold
were considered highly significant and selected for further analysis.
Subsequently, these SNPs were compared against the reference genome to
pinpoint candidate genes for further investigation. It is anticipated
that the identification of these genes will significantly contribute to
a breeding program for buffaloes, leading to enhancements in both milk
production quantity and the quality of buffalo dairy products.
Generalized Linear Models are a widely employed and versatile
statistical method for data analysis [[56]17]. In the present study, we
utilized Generalized Linear Models for conducting a genome-wide
association analysis. The GLM (Q) analysis model is expressed as
follows:
[MATH:
y=Xα+Qβ+cP+e :MATH]
In this formula, y is the vector of phenotypes, X is the genotype
matrix,
[MATH: α :MATH]
is the vector of genotype effects,
[MATH: P :MATH]
is the PCA variance_explained matrix,
[MATH: c :MATH]
is the vector of PCA variance effects, e is the vector of residual
effects, and Q refers to the fixed-effect matrix, which represents calf
gender, calving year, and herds. The outcomes are presented through
Manhattan plots and Q-Q plots. SNPs exhibiting p-values below the
specified threshold 0.05/N (number of SNP) are identified as highly
significant SNPs. When a reference genome is available, candidate genes
are determined by including those genes that are physically positioned
within a 50 kb genomic region surrounding the significant SNPs. DbSNP
[[57]18] is a database specifically designed by NCBI to store genetic
variation information. We use dbSNP to determine whether the SNPs we
have identified are located in the coding regions of genes.
2.9. Pathway Enrichment and Protein–Protein Interaction
Genes often work in concert to perform specific biological functions.
Pathway-based analysis is a valuable approach for understanding the
roles of genes in these complex processes. The KEGG (Kyoto Encyclopedia
of Genes and Genomes) database [[58]19] stands as one of the foremost
publicly accessible resources for pathway-related data. To identify
significantly enriched metabolic and signal transduction pathways among
CAGs (Candidate-Associated Genes) relative to the entire genome
context, pathway enrichment analysis was performed. The method for
calculating enrichment is consistent with that employed in Gene
Ontology (GO) [[59]20] analysis:
[MATH:
P=1−<
mo
stretchy="true">∑i=0
m−1(Mi)(N−Mn−i)(
Nn) :MATH]
In this context, N signifies the total count of genes with KEGG
annotations, while n denotes the number of CAGs within N. M represents
the total number of genes annotated to particular pathways, and m is
the number of CAGs in M. Following the calculation of the p-value, it
was corrected using False Discovery Rate (FDR) adjustment, with an FDR
value of 0.05 or less being set as the threshold. Pathways that meet
this criterion are categorized as significantly enriched pathways in
CAGs. Finally, we utilize the String database to identify genes that
are significantly represented in pathways and create a protein–protein
interaction map. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
Genomes (KEGG) analyses were performed using the OmicShare tools, a
free online platform for data analysis
([60]http://www.omicshare.com/tools, accessed on 9 January 2025).
2.10. Statistical Analysis
In the analysis of trait correlations, we used Pearson’s correlation
coefficient to quantify the linear relationships between different
traits. The Pearson’s correlation coefficient ranges from −1 to 1,
where values close to 1 indicate a strong positive correlation, values
close to −1 indicate a strong negative correlation, and values around 0
indicate no significant correlation. Statistical analyses were
performed using the SPSS 18.0 software package (SPSS Science, Chicago,
IL, USA). Experimental data were subjected to t-test and ANOVA
analyses, with a significance threshold set at p < 0.05. Graphs were
generated using GraphPad Prism 8 software (GraphPad, Santiago, MN,
USA). Data are presented as mean ± standard deviation (SD).
3. Results
3.1. Phenotypic Value Statistics of the Traits
During the phenotypic evaluation of buffalo. We carried out MY, PM, PY,
PP, FP, and FY phenotypic value analysis.
For milk production traits, the mean value of MY was recorded as 2321.3
kg, while the mean value of PM was measured as 11.4 kg. The mean value
of FY was recorded as 116.1 kg, while the mean value of PY was measured
as 80.8 kg. The average protein and fat percentages in our population
were 4.8% and 5.2%, respectively. The phenotypic data statistics are
presented in [61]Table 1.
Table 1.
Statistical description of lactation traits *.
Traits Mean SD Min Max
MY 2321.3 861.8 480.8 5185.3
PM 11.4 4.8 2.7 24.4
PY 80.8 41.2 4.0 275.8
FY 116.1 57.9 5.0 363.3
PP 4.8 0.4 4.4 5.8
FP 5.2 0.8 3.7 8.9
[62]Open in a new tab
* MY, milk yield; PM, peak milk yield; PY, protein yield; FY, fat
yield; PP, protein percentage; FP, fat percentage; SD, standard
deviation.
Through the correlation analysis of various lactation traits, it was
found that the correlation between total protein yield (PY) and total
milk fat yield (FY) is the highest, reaching above 0.9 ([63]Figure 1).
This strong positive correlation suggests a significant relationship
between the production of protein and fat in milk, indicating that
these two components tend to vary together in response to genetic or
environmental factors.
Figure 1.
[64]Figure 1
[65]Open in a new tab
Correlation analysis of various lactation traits. MY, milk yield; PM,
peak milk yield; PY, protein yield; FY, fat yield; PP, protein
percentage; FP, fat percentage.
In this study, a comprehensive analysis of genome-wide variations led
to the identification of 2,208,174 genetic markers. Among the detected
genetic markers, 2,012,270 were identified as SNPs and 195,904 were
classified as insertion–deletion (Indel) variants. Following stringent
filtering criteria, a refined set of 99,261 markers was retained,
comprising 93,494 SNPs and 5767 Indels.
3.2. Population Structure
Upon obtaining PCA scores, the samples under investigation can be
visualized via a scatter plot that utilizes the values of the first
three principal components as axes. Referring to [66]Figure 2, in
scatter plots [67]Figure 2A, it is evident that the majority of
individuals within herds MB and NB are distinctly isolated from one
another. In the scatter plot in [68]Figure 2B, we can observe
clustering, particularly in groups MB and ZB, where these two clusters
overlap and are closely grouped together in multiple instances. In
[69]Figure 2A,C, it is evident that the herds are broadly segregated
into three distinct clusters. One cluster predominantly consists of NB
herds, whereas the other two clusters are primarily composed of MB, ZB,
and DB herds, respectively.
Figure 2.
[70]Figure 2
[71]Open in a new tab
The sample clustering obtained from PCA through three two-dimensional
scatter plots, namely scatter (A), scatter (B), and scatter (C); scree
plot (D). The percentage of variance explained by each PC is noted in
parentheses. In the scatter plots, colored circles represent four
different groups: DB, MB, NB, and ZB correspond to 1 DB, 42 MBs, 31
NBs, and 46 ZBs, respectively.
To establish the optimal number of clusters (k), we used the Admixture
software and evaluated cross-validation error rates. The Admixture
algorithm performs model-based clustering and estimates the proportion
of ancestry from K ancestral populations. By minimizing the
cross-validation error rates, we identified the value of k = 3 that
best fits our data. [72]Figure 3 displays the line graph depicting the
cross-validation error rate.
Figure 3.
[73]Figure 3
[74]Open in a new tab
The line graph illustrating the cross-validation error rate is
depicted, with the number of sample clusters delineated along the
x-axis and the corresponding cross-validation error rate indicated on
the y-axis.
To simulate the population classification and genetic ancestry of each
sample across varying numbers of subgroups (K = 2–9), we utilized the
PopHelper software (v2.2.7) [[75]15] to generate bar plots illustrating
the genetic composition of each sample within every subgroup. The
results are presented in [76]Figure 4, where each color corresponds to
a distinct cluster for each K-value. From the line graph of the
cross-validation error rate, it is evident that the optimal number of
clusters is K = 3. Similarly, as observed in the bar graph ([77]Figure
3) depicting the genetic composition of the samples, when K = 3, it is
the optimal number of clusters for these 120 buffaloes. This finding is
consistent with the results obtained from the PCA. Consequently, we
conclude that these 120 buffaloes can be effectively divided into three
distinct subgroups.
Figure 4.
[78]Figure 4
[79]Open in a new tab
Genetic bar chart illustration for K-means clustering with varying
numbers of clusters (K = 2 to 9).
3.3. Results of the Genome-Wide Associations
Following the calculation of p-values for the SNP loci using a
Generalized Linear Model, we constructed a Manhattan plot and a Q-Q
plot, as presented in [80]Figure 5.
Figure 5.
[81]Figure 5
[82]Open in a new tab
Association analysis with milk production-related traits in water
buffalo was conducted using the GLM-Q approach. The traits investigated
include MY (A), PM (B), PY (C), FY (D), PP (E), and FP (F). The
Manhattan plot on the left, created using the qqman package,
illustrates the p-values for SNP markers across 25 chromosomes
(comprising 24 autosomes and 1 X chromosome). The blue line delineating
the Manhattan plot signifies the significance threshold, determined by
0.05/N (number of SNP). Markers that surpass this threshold are deemed
significant. The plot on the right is a Q-Q plot, where the x-axis
denotes the observed values of the markers, and the y-axis represents
the expected values, which have been transformed into the −10 log
scale.
The leftmost plot is a Manhattan plot, where the visually discernible
blue line, running parallel to the x-axis, serves as the critical
demarcation line. In the Manhattan plot, the points that rise above the
threshold line, which is the blue line paralleling the x-axis, signify
significant loci. Upon identifying the significant loci that surpass
the threshold line in the Manhattan plot, the subsequent step involves
documenting the relevant information pertaining to these significant
loci.
The plot on the right is the Q-Q plot, where the points in the bottom
left corner fall along the line, suggesting that the observed p-values
align closely with the expected values. The points exhibit a distinct
upward deviation from the diagonal in the upper right corner, which
signifies that the observed p-values exceed the anticipated values. The
presence of these points, which denote significant loci, across all
four plots underscores the validity of the analytical model, suggesting
its appropriateness in capturing the underlying patterns. Following
this, the genes situated within a 50 kb range of the significant loci
are carefully selected and earmarked as candidate genes.
In our GWAS, we have identified a large number of SNPs associated with
FY and PY ([83]Figure 5). We identified 56 statistically significant
SNPs and 54 candidate genes within a 50 Kb range surrounding these loci
that were associated with the traits MY, PY, PP, FP, and FY ([84]Table
2).
Table 2.
The SNPs identified via genome-wide association analysis encompass
detailed information regarding their chromosomal locations, p-values,
and associated candidate genes.
Traits SNP Chr Pos p R^2 Candidate Genes
MY 1 [85]NC_037552.1 110795896 4.3 × 10^−7 0.39 CNTNAP2
PY 2 [86]NC_037545.1 45287655 4.01 × 10^−7 0.36 --
PY 3 [87]NC_037545.1 45287667 4.01 × 10^−7 0.36 --
PY 4 [88]NC_037545.1 45287677 4.01 × 10^−7 0.36 --
PY 5 [89]NC_037545.1 45287689 4.01 × 10^−7 0.36 --
PY 6 [90]NC_037545.1 155842848 6.41 × 10^−7 0.43 KCNAB1
PY 7 [91]NC_037546.1 174939089 1.82 × 10^−7 0.37 TINAGL1; AZIN2
PY 8 [92]NC_037547.1 10261615 8.43 × 10^−8 0.35 RBFOX3
PY 9 [93]NC_037548.1 58828968 3.56 × 10^−7 0.33 NEDD1
PY 10 [94]NC_037548.1 97643871 9.64 × 10^−8 0.369 EEA1; PLEKHG7
PY 11 [95]NC_037549.1 52254091 4.08 × 10^−7 0.31 LEFTY2; PYCR2; -
PY 12 [96]NC_037549.1 122104312 8.972 × 10^−8 0.41 DOC2G; NUDT8; TBX10;
ALDH3B1; UNC93B1; ALDH3B1; NDUFS8; TCIRG1
PY 13 [97]NC_037550.1 7990287 2.51 × 10^−7 0.34 CFAP126; SDHC
PY 14 [98]NC_037550.1 109178670 1.5 × 10^−9 0.48 MAP7D1; TRAPPC3;
COL8A; ADPRHL2; TEKT2
PY 15 [99]NC_037551.1 62424439 7.20 × 10^−11 0.54 --
PY 16 [100]NC_037551.1 87385802 1.7 × 10^−9 0.43 CAMK2D
PY 17 [101]NC_037552.1 119759448 4.64 × 10^−7 0.30 USP17L13
PY 18 [102]NC_037556.1 9792358 3.18 × 10^−7 0.36 --
PY 19 [103]NC_037557.1 18868198 1.71 × 10^−7 0.34 ABCC4
PY 20 [104]NC_037557.1 18868200 1.72 × 10^−7 0.34 ABCC4
PY 21 [105]NC_037557.1 18868226 1.71 × 10^−7 0.34 ABCC4
PY 22 [106]NC_037557.1 18868442 2.02 × 10^−7 0.34 ABCC4
PY 23 [107]NC_037557.1 18868443 2.02 × 10^−7 0.34 ABCC4
PY 24 [108]NC_037557.1 18868449 2.02 × 10^−7 0.34 ABCC4
PY 25 [109]NC_037557.1 18868500 1.71 × 10^−7 0.34 ABCC4
PY 26 [110]NC_037557.1 18868507 1.71 × 10^−7 0.34 ABCC4
PY 27 [111]NC_037557.1 18868523 1.71 × 10^−7 0.34 ABCC4
PY 28 [112]NC_037560.1 28753781 9.90 × 10^−8 0.40 GUCY2D; LRRC32
PY 29 [113]NC_037560.1 35600620 3.36 × 10^−8 0.40 OR52Z1; OR51V1;
OR51V1; OR52A5; OR52K1; OR52K1
PY 30 [114]NC_037560.1 60745000 4.5 × 10^−10 0.44 TMPRSS5
PY 31 [115]NC_037560.1 74611905 1.48 × 10^−8 0.38 CNTN5
PY 32 [116]NC_037562.1 12469915 1.04 × 10^−8 0.43 --
PY 33 [117]NC_037564.1 56654985 8.12 × 10^−8 0.39 --
PY 34 [118]NC_037564.1 56655021 8.12 × 10^−8 0.39 --
PY 35 [119]NC_037565.1 13765148 1.19 × 10^−7 0.37 CTNNB1
PY 36 [120]NC_037566.1 61557795 1.04 × 10^−7 0.37 KCNG2; PQLC1; TXNL4A;
YVCT
PY 37 [121]NC_037567.1 17641416 1.31 × 10^−8 0.39 TLL2; TM9SF3
PY 38 [122]NC_037567.1 17641470 1.91 × 10^−8 0.39 TLL2; TM9SF3
PY 39 [123]NC_037567.1 17641671 1.31 × 10^−8 0.39 TLL2; TM9SF3
PY 40 [124]NC_037567.1 44281645 8.32 × 10^−8 0.34 --
FY 41 [125]NC_037550.1 109178670 8.94 × 10^−8 0.37 MAP7D1; TRAPPC3;
COL8A; ADPRHL2; TEKT2
FY 42 [126]NC_037551.1 62424439 1.7 × 10^−8 0.42 --
FY 43 [127]NC_037557.1 18651732 2.48 × 10^−7 0.32 ABCC4
FY 44 [128]NC_037560.1 60745000 4.44 × 10^−7 0.32 TMPRSS5
FY 45 [129]NC_037560.1 74050792 3.67 × 10^−7 0.36 --
PP 46 [130]NC_037556.1 50669172 3.34 × 10^−8 0.43 --
FP 47 [131]NC_037545.1 35877328 4.11 × 10^−7 0.30 --
FP 48 [132]NC_037546.1 6705158 3.44 × 10^−7 0.27 --
FP 49 [133]NC_037548.1 4405934 6.25 × 10^−8 0.33 FAM118A; UPK3A;
KIAA0930
FP 50 [134]NC_037552.1 24460 6.65 × 10^−8 0.29 --
FP 51 [135]NC_037552.1 112491377 3.27 × 10^−8 0.32 ZNF777; ZNF746
FP 52 [136]NC_037552.1 113607118 2.36 × 10^−8 0.33 ABCB8; ASIC3
FP 53 [137]NC_037555.1 29774524 2.45 × 10^−7 0.28 PRKCH
FP 54 [138]NC_037559.1 81684074 5.71 × 10^−8 0.32 DGAT1; HSF1
FP 55 [139]NC_037564.1 36725181 9.35 × 10^−8 0.31 LINGO1
FP 56 [140]NC_037569.1 5832499 2.09 × 10^−8 0.35 KAL1
[141]Open in a new tab
3.4. Kyoto Encyclopedia of Genes and Genomes Pathway Analysis of Candidate
Genes
GO and KEGG analyses were conducted to examine the functional pathways
for DEGs. The top 20 GO terms, classified by –log10(p-value), were
significantly enriched in candidate genes compared to the genome
background GO. The top 20 GO terms included eight biological processes
(BPs), a cellular component (CC) and one molecular function (MF). The
BP terms were enriched in growth (GO:004007), developmental processes
(GO:0032502), cellular processes (GO:0009987), metabolic processes
(GO:0008152), and cellular component organization or biogenesis
(GO:0071840) ([142]Figure 6A, [143]Supplementary Table S1).
Figure 6.
[144]Figure 6
[145]Open in a new tab
GO and KEGG analysis of candidate genes. (A) GO bar plot diagram
showing the top 20 enriched GO terms. GO categories, including cellular
component, biological process, and molecular function. (B) The
enrichment circle diagram shows the KEGG analysis of the top 20
pathways. Four circles from the outside to the inside. First circle:
the classification of enrichment; outside the circle is the scale of
the number of genes. Different colors represent different categories.
Second circle: number and p-values of the classification in the
background genes. The more genes, the longer the bars; the smaller the
value, the redder the color. Third circle: bar chart of the total
number of candidate genes. Fourth circle: rich factor value of each
classification (number of candidate genes in this classification
divided by the number of background genes). Each cell of the background
helper line represents 0.1, and the color coding signifies the
statistical significance of the corresponding enrichment.
The functional enrichment cycle diagram displays the top 20 KEGG
pathways ([146]Figure 6B, [147]Supplementary Tables S2 and S3),
classified by –log10 (p-value), which reveals that the candidate genes
were mainly enriched in five KEGG_A_class pathways, including
metabolism, environmental information processing, cellular processes,
organismal systems, and human diseases. Among these pathways,
environmental information processing included ABC transporters
(ko02010), the cAMP signaling pathway (ko04024), the TGF-β signaling
pathway (ko04350), and the wnt signaling pathway (ko04310); metabolism
included arginine and proline metabolism (ko00330) and phenylalanine
metabolism (ko00360); biosynthesis of amino acids (ko01230) and
metabolic pathways was involved (ko01100); and cellular processes
included signaling pathways regulating pluripotency of stem cells
(ko04550). In conclusion, our data suggest that the identified
candidate genes play a crucial role in regulating lactation
performance, particularly in terms of milk fat yield (FY), protein
percentage(PP), fat percentage (FP), and protein yield (PY), by
modulating the tgf-β signaling pathway, wnt signaling pathway,
metabolic pathways, and cAMP signaling pathway. These findings provide
valuable insights into the molecular mechanisms underlying dairy
production and could inform future breeding strategies to enhance milk
quality and quantity.
3.5. Significant Association of Milk Protein Content with SNP Validation
[148]Table 3 presents the results of individual genotyping for four key
loci in water buffalo: [149]NC_037557.1:18868198 (ABCC4),
[150]NC_037549.1:52254091 (LEFTY2), [151]NC_037559.1:81684074 (DGAT1),
and [152]NC_037551.1:87385802 (CAMK2D). The analysis reveals
significant differences in milk protein content across different
genotypes at these loci.
Table 3.
The results of individual genotyping *.
Candidate Genes SNP (Chr:Pos) Milk Protein Yield
Homozygous Mutation Heterozygous Mutation Reference Genotype
ABCC4 [153]NC_037557.1:18868198 A/A G/A G/G
73.46 ± 29.68C 103.97 ± 47.99B 144.57 ± 65.04A
LEFTY2 [154]NC_037549.1:52254091 A/A G/A G/G
73.87 ± 31.82B 97.07 ± 19.41B 151.03 ± 74.08A
DGAT1 [155]NC_037559.1:81684074 T/T C/T C/C
67.01 ± 38.53B 90.62 ± 45.36B 125.61 ± 71.25A
CAMK2D [156]NC_037551.1:87385802 C/C C/T T/T
62.03 ± 48.35C 95.85 ± 58.65B 133.24 ± 79.52A
[157]Open in a new tab
* The phenotypic values of milk protein content are expressed as “least
squares mean ± standard deviation”. Different letters in the same
column indicate significant differences (p < 0.05); the same letter or
no letter indicates no significant difference (p > 0.05). The genotypes
of SNP loci are arranged in the order of homozygous mutant,
heterozygous, and reference types.
For ABCC4 ([158]NC_037557.1:18868198), the A/A genotype is associated
with significantly lower milk protein yield compared to the G/A and G/G
genotypes.
For LEFTY2 ([159]NC_037549.1:52254091), both the A/A and G/A genotypes
exhibit significantly lower milk protein yield compared to the G/G
genotype.
For DGAT1 ([160]NC_037559.1:81684074), the T/T and C/T genotypes are
associated with significantly lower milk protein yield compared to the
C/C genotype.
For CAMK2D ([161]NC_037551.1:87385802), the T/T genotype shows
significantly lower milk protein yield compared to both the CT and C/C
genotypes.
These findings highlight the influence of specific genotypes on milk
protein yield, providing valuable insights for genetic selection and
breeding programs aimed at improving milk quality in water buffalo.
4. Discussion
4.1. Population Stratification
Population stratification is a pivotal factor in GWASs, as it
significantly influences the findings due to disparities in ancestral
origins. These differences can give rise to discrepancies in allele
frequencies across populations, potentially generating spurious
association signals [[162]21,[163]22]. To mitigate the risk of
false-positive findings in our analysis, it is imperative to
acknowledge the existence of population stratification. PCA is capable
of diminishing the complexity of a dataset while keeping its covariance
structure intact [[164]23]. To assess the classification effectiveness
of the first 10 principal components, we calculated the proportion of
variance explained by each component using the filtered SNP markers
with PLINK. This analysis helped us understand how well the principal
components capture the genetic variation within the population.
Furthermore, we used the Admixture software to perform K-cluster
analysis and evaluated the cross-validation error (CV-error) to
determine the optimal number of subpopulations. By minimizing the
CV-error, we identified the most appropriate value of K for population
stratification. Though these buffaloes are raised on the same farm,
their origins differ: the Nili-Ravi buffalo hails from Pakistan, while
the Murrah buffalo originates from India. From our PCA plot, we observe
clustering, particularly between groups MB and ZB, where these two
clusters overlap and coalesce in several instances. This indicates that
the buffaloes in these two groups may have a closer genetic
relationship compared to the other groups. Overall, these groups can be
generally categorized into two distinct segments. The plot depicting
cross-validation error suggests that the optimal value of K, which
corresponds to the lowest cross-validation error, is 3. Therefore, it
is concluded that these 120 water buffaloes should be divided into
three subgroups.
The simple linear model serves as a valuable instrument for conducting
SNP and phenotype analysis, concurrently managing population
stratification by incorporating the relevant population structure as a
covariate [[165]24]. Typically, the models used in GWAS analysis are
adjusted based on the genetic background and stratification of the
dataset. In general, after correcting for population stratification,
the inflation factor should approach a value of 1 in instances that
adhere to a normal distribution [[166]24]. In our experimental
outcomes, the inflation factor for MY was 1.056, that for PM was 1.1,
that for PY was 0909, that for PP was 1.06, that for FP was 1.12, and
that for total FY was 0.916. These findings suggest that our model’s
adjustment for population stratification is valid.
4.2. Genome-Wide Association Analysis of Milk Production-Related Traits
Enhancing milk production and quality in water buffalo has emerged as a
critical research priority within the dairy industry. Water buffalo
play a vital role in global dairy production, particularly in regions
where they are the primary milk source. However, there is a significant
research gap in understanding the lactation traits of water buffalo.
Unlike Holstein cows, which have been extensively studied through
detailed production records and numerous genome-wide association
studies (GWASs), water buffalo have received considerably less
attention in these areas. This disparity in research means that key
aspects of water buffalo genetics, such as the identification of
candidate genes and pathways influencing milk yield, protein content,
and fat composition, remain underexplored. As a result, breeding
programs for water buffalo are not as advanced or effective as those
for other dairy species, limiting the potential for improving milk
production and quality. Addressing this research gap is essential for
developing targeted breeding strategies and enhancing the overall
productivity of water buffalo dairy operations.
For milk production traits, this investigation assessed and quantified
the MY as well as the PM. The mean values were 2321.3 kg and 11.4 kg,
respectively. This investigation also assessed and quantified the PY,
PP, FP, and FY. The mean values were 80.8 kg, 4.8%, 5.2%, and 116.1 kg,
respectively. The mean milk yield (MY) of our buffalo population is
lower than that of Mediterranean buffaloes (2321.3 kg) but higher than
that of Brazilian buffaloes (1578.90 kg). In contrast, the mean values
for peak milk yield (PM), fat yield (FY), and protein yield (PY) are
more closely aligned with those of Mediterranean buffaloes
[[167]25,[168]26]. Therefore, we are confident that our production
record measurements are within the normal range and can be converted
into corresponding phenotype vectors in the model. We carried out an
association analysis for each of the phenotypic traits.
In the association analysis of PY and FY, we identified a total of 54
candidate genes. Lactation is a highly complex process involving the
coordinated action of multiple signaling pathways. Among the identified
genes, several are associated with key pathways, including the cAMP
signaling pathway (ABCC4), TGF-β signaling pathway (LEFTY2), Wnt
signaling pathway (CAMK2D), and metabolic pathways (DGAT1).
The ABCC4 gene, located on chromosome [169]NC_037557.1, encodes
ATP-binding cassette (ABC) transporters. These proteins are essential
for modulating platelet aggregation, a critical process in blood
clotting and hemostasis [[170]27]. Moreover, ABCC4 has been detected in
milk from the early stages of lactation [[171]28]. Our findings
indicate that ABCC4 may regulate milk production by influencing the
cAMP signaling pathway, suggesting its potential importance in
lactation biology.
The TGF-β signaling pathway (LEFTY2) plays a pivotal role in mammary
gland development and lactation [[172]29,[173]30,[174]31]. Studies have
shown that the TGF-β pathway not only regulates the proliferation and
differentiation of mammary epithelial cells but also influences the
synthesis and secretion of milk components. Specifically, TGF-β
promotes the self-renewal of mammary stem cells while inhibiting
excessive cell proliferation, ensuring the proper development and
function of mammary tissue.
The Wnt signaling pathway is a complex regulatory network that plays a
crucial role in controlling cell growth, differentiation, and tissue
development. In particular, breast development is intricately linked to
lactation performance, as the proper formation and function of the
mammary gland are essential for efficient milk production. Extensive
previous research has shown that the gene expression and epigenetic
regulation of CAMK2D (Calcium/Calmodulin-Dependent Protein Kinase II
Delta) significantly impact the physiological functions of the mammary
gland. CAMK2D is involved in various cellular processes, including
calcium signaling and gene transcription, which are critical for
mammary gland development and lactation [[175]32,[176]33].
Milk fat provides essential fatty acids (FAs) that can contribute to
various health benefits, such as supporting cardiovascular health,
enhancing nutrient absorption, and promoting overall well-being,
depending on their specific composition. Previous research has
demonstrated that polymorphisms in the DGAT1 (Diacylglycerol
O-Acyltransferase 1) gene influence various milk production traits,
including milk yield, fat yield, protein yield, and the fat and protein
content of milk [[177]34,[178]35,[179]36]. Consistent with these
findings, our study revealed that DGAT1 polymorphisms have a
significant impact on fat content, further highlighting the importance
of this gene in determining milk composition.
4.3. The Mechanism of SNP Mutation and the Milk Production Traits
Among the selected SNPs, none are non-synonymous. However, we
identified base mutations in non-coding regions, specifically at
positions [180]NC_037557.1:18868198, [181]NC_037549.1:52254091,
[182]NC_037559.1:81684074, and [183]NC_037551.1:87385802, which may
potentially influence milk production-related traits. These results
suggest that variation in non-coding regions may play an important role
in the regulation of these important production and quality traits. The
variability of non-transcriptional regulatory sequences (e.g.,
promoters, enhancers, CTCF sites) is closely related to the mechanism
of non-coding variation in cell development, but whether this rule
applies to milk production traits needs further verification
[[184]37,[185]38].
5. Conclusions
In this investigation, genome-wide association analysis was performed
on four milk production-related characteristics in domestic water
buffaloes, which included MY, PM, PY, and FY. A total of 56 significant
SNP loci were identified, and 54 candidate genes within a 50 Kb range
surrounding these loci were selected. These candidate genes were
enriched in biological processes such as the cAMP signaling pathway
(ABCC4), the TGF-β signaling pathway (LEFTY2), the Wnt signaling
pathway (CAMK2D), and metabolic pathways (DGAT1), all of which are
directly or indirectly involved in the lactation process. These
findings offer a reference framework for comprehending the genetic
architecture underlying milk production and quality attributes in water
buffaloes, thereby paving the way for subsequent biological validation
of the implicated genes. This is crucial guidance for breeding and
improvement programs in water buffaloes. In future studies, we will
focus on elucidating the relationships between candidate genes and
metabolites involved in the lactation process to uncover the metabolic
pathways and mechanisms through which these genes influence milk
production and composition.
Acknowledgments