Abstract Water buffalo (Bubalus bubalis) are an important animal resource that contributes milk, meat, leather, dairy products, and power for plowing and transport. However, mastitis, a bacterial disease affecting milk production and reproduction efficiency, is most prevalent in populations having intensive selection for higher milk yield, especially where the inbreeding level is also high. Climate change and poor hygiene management practices further complicate the issue. The management of this disease faces major challenges, like antibiotic resistance, maximum residue level, horizontal gene transfer, and limited success in resistance breeding. Bovine mastitis genome wide association studies have had limited success due to breed differences, sample sizes, and minor allele frequency, lowering the power to detect the diseases associated with SNPs. In this work, we focused on the application of targeted gene panels (TGPs) in screening for candidate gene association analysis, and how this approach overcomes the limitation of genome wide association studies. This work will facilitate the targeted sequencing of buffalo genomic regions with high depth coverage required to mine the extremely rare variants potentially associated with buffalo mastitis. Although the whole genome assembly of water buffalo is available, neither mastitis genes are predicted nor TGP in the form of web-genomic resources are available for future variant mining and association studies. Out of the 129 mastitis associated genes of cattle, 101 were completely mapped on the buffalo genome to make TGP. This further helped in identifying rare variants in water buffalo. Eighty-five genes were validated in the buffalo gene expression atlas, with the RNA-Seq data of 50 tissues. The functions of 97 genes were predicted, revealing 225 pathways. The mastitis proteins were used for protein-protein interaction network analysis to obtain additional cross-talking proteins. A total of 1,306 SNPs and 152 indels were identified from 101 genes. Water Buffalo-MSTdb was developed with 3-tier architecture to retrieve mastitis associated genes having genomic coordinates with chromosomal details for TGP sequencing for mining of minor alleles for further association studies. Lastly, a web-genomic resource was made available to mine variants of targeted gene panels in buffalo for mastitis resistance breeding in an endeavor to ensure improved productivity and the reproductive efficiency of water buffalo. Keywords: GWAS, mammary gland, markers, mastitis, TGP, water buffalo Introduction Water buffalo (Bubalus bubalis) have been domesticated for more than 5,000 years, are also known as “Black Gold” due to their economic value. Its contribution to the Indian GDP in terms of milk, meat, horns, hides, leather, dairy products like cream, butter, yogurt, and cheese, power, plowing, and transporting people and crops have made buffalo an important animal resource in rural areas. India ranks first in world milk production due to the largest bovine population, where buffalo contributes 55%. The estimated world population of water buffalo is 208 million which is spread across over 77 countries in five continents with a major population (96%) in Asia ([45]1). Water buffalo is an efficient converter of poor-quality forages into high quality milk and meat, making it a very valuable genetic resource for countries having “Low External-Input System” ([46]2). Its efficiency of rumen fermentation ([47]3) and nitrogen utilization ([48]4) ability make it the animal of preference. However, intense selection for dairy productivity often results in increased levels of inbreeding, due to the limited number of bulls in the gene pool. Furthermore, climate change with a rising temperature humidity index (THI) has induced mastitis and many other diseases in major dairy animals including buffalo like retained placenta, metritis, ovarian cysts, claw diseases, milk fever, ketosis, and displaced abomasum ([49]5–[50]7). Mastitis infections cause inflammation of the mammary gland and udder tissue, which is one of the most complex, costlier endemic diseases of dairy animals. It occurs as an immune response to invasion by a group of bacterial species in the teat canal. This may also occur due to chemical, mechanical, or thermal injury to the udder. This disease causes huge economic loss due to lesser milk production, premature culling, veterinary costs, and the risk of drug residues ([51]8). Thus, there is a need to reduce mastitis to increase profitability and health ([52]9). Reduced milk production in buffalo mastitis costs US$8.80 per lactation ([53]10). The direct loss happens due to a reduction in milk yield (70%), milk discard after treatment (9%), cost of veterinary services (7%), and premature culling (14%). There is also an indirect loss by adverse reproductive efficiency because of an imbalance in luteinizing hormone (LH) and estradiol that leads to failure of ovulation ([54]11). In India, a total annual loss due to mastitis (clinical and subclinical) in buffalo has been estimated at US$526 million ([55]10). Both clinical and subclinical mastitis is associated with mammary inflammation, a difference seen in clinical detectable and undetectable changes, respectively. With clinical mastitis, clinical visible signs are accompanied in the milk or mammary parenchyma. In subclinical mastitis conditions, a specific diagnostic test is performed ([56]12). Antibiotic treatment is effective in only 60% of cases of mastitis, which is due to an increase in b-lactamase producing organisms ([57]10). The indiscriminate use of antibiotics has also led to a rise in resistant bacterial strains in Indian buffalo ([58]13). All these are of global concern which is highly alarming due to the rise in AMR (antimicrobial resistance) and HGT (Horizontal Gene Transfer) of ARG (Antibiotic Resistance Genes) ([59]14). The traditional breeding approach for the reduction of mastitis has not been effective due to its low heritability and its unfavorable genetic correlation with milk yield ([60]15). Recent studies on GWAS of mastitis disease in bovines have revealed that breed difference, sample size, and minor allele frequency, may lower the power to detect the disease-associated SNPs ([61]16). Though universal NGS array based GWAS and genomic selection have been successful in animal breeding for better genetic gain of productivity traits, mastitis resistance breeding has still had limited success. GWAS has a major limitation in that it implicates the entire genome in disease predisposition where most association signals reflect variants/genes with no direct biological relevance to the disease ([62]17). A control group of animals may contain unexposed individuals with susceptible genotypes. In such situations, a candidate gene approach tends to have greater statistical power than GWAS ([63]18). GWAS can miss data on intron retention, which works in disease resistance, as reported in water buffalo ([64]19). The success of the candidate gene approach depends on the correct choice of targeted gene panel (TGP) for SNP discovery, focusing on disease associated pathways in the investigation ([65]20). In genome assisted selection, the candidate gene approach using TGP and GWAS complement each other, rather than being contradictory. TGP can be used pragmatically in dissecting the genetics of complex disease by the mining of extremely low frequency alleles required in case-control association analysis. This ensures that minor novel alleles are not missed, which are not present in the prefabricated SNP array/chip ([66]21, [67]22). TGP approach offers multifold advantages like sequencing at higher depth (500–1000X coverage) than the whole genome/exome sequencing, allowing the discovery of extremely rare variants which are potentially the most valuable alleles for association studies ([68]23, [69]24). It is rapid and cost effective, especially in the case of association studies where the sample size is limited and mining of causative mutations in a single assay is imperative ([70]25). Variant mining of TGP can be done either by direct amplicon sequencing (if TGP <50) or by target enrichment by magnetic beads followed by NGS data generation ([71]26). Post-GWAS, prioritization of candidate genes for mastitis resistance breeding in cattle has been reported ([72]27). A computational approach has been reported to identify the candidate genes for disease association studies ([73]28). A genomic resource database for cattle candidate genes and genetic markers for milk production and mastitis has also been developed ([74]29). However, there is no targeted gene panel genomic resource of water buffalo for mastitis association studies. Recently available WGA with chromosome wise genomic data can be used for the development of genomic resources cataloging TGP. Since taurine and babuline genomes have extensively conserved milk and mastitis associated genes, comparative genomics approaches can quickly enrich buffalo genomic resource development. The present work aims to develop a targeted gene panel of mastitis genes in water buffalo. It anticipates the prediction of mastitis associated candidate genes in the water buffalo genome using publically available bovine mastitis genes. It also aims to validate the RNA-Seq library in water buffalo, along with predicting biochemical pathways mediating this disease through the protein-protein network analysis. We present a web-genomic resource that is required for variant mining for future association studies of mastitis disease. Materials and Methods Mining and Mapping of Bovine Mastitis Associated Genes An intensive literature search was carried out covering >40 literature across water buffalo (Bubalus bubalis) and cattle (Bos taurus) genes. Distinct terms associated with mastitis, namely, mastitis resistance, tolerance, and traits association were used as keywords for this search. These records were compiled and filtered to remove duplicates to finally fetch the genes associated with mastitis for further analyses in the present study. Earlier studies on buffalo mastitis genes were reported even before the availability of the de novo genome assembly but then chromosomal number and position in the physical map were not known. By applying the NGS-based variant mining from TGP in a given buffalo population, the targeted genes or region of interest can be mapped. To accomplish this, the entire set of genes reported in Bubalus bubalis was downloaded from NCBI ([75]30). Prior to further analyses, these were subjected to manual curation based on the organism specific term “Bubalus bubalis.” The finally filtered genes were used to map mastitis associated genes to obtain their genomic coordinates in the chromosome-wise genome assembly. In-house Perl scripts were used to map, based on gene symbol and its alias. Based on these parameters, mapping was accomplished for genes indicating features like gene symbol, gene ID, description, chromosome number, chromosome start and end location, strand orientation, and exon count. Validation of Predicted TGP Genes in the Water Buffalo Gene Expression Atlas Since genes in TGP panels are computationally predicted using cattle genome, their expressional reliability was validated in the RNA-Seq data of water buffalo available in the gene expression atlas of 50 different tissues ([76]31). This validation was done using an in-house generated Perl script. Functional Prediction of Mastitis Associated Candidate Genes The function and pathway enrichment analysis of genes were performed using PANTHER ([77]32) and DAVID v6.8 (Database for Annotation, Visualization, and Integrated Discovery) available at [78]https://david.ncifcrf.gov/ ([79]33, [80]34). It utilizes the Kyoto Encyclopedia of Genes and Genomes (KEGG) for pathway analysis and Gene Ontology (GO) analyses for functional prediction. Annotation tool DAVID had Bubalus bubalis specific 91 genes. Since, in bovines, mastitis-associated genes are highly conserved, further analysis was done by taking advantage of this using cattle (Bos taurus) in PANTHER ([81]http://pantherdb.org/data/). KEGG Mapper (Version 4.1), a mapping tool was used to reconstruct pathways ([82]35). KEGG Orthology (KO) annotation was retrieved for key candidate genes and pathway reconstruction was performed specific to the Bubalus bubalis species. Prediction of Cross-Talking Proteins by Protein-Protein Network Analysis The list of genome-based candidate genes was further enriched by computational prediction of mastitis disease associated proteins (DAP) using protein-protein interaction (PPI) analysis. The amino acid sequences of mastitis associated genes were retrieved from NCBI, using advanced search options with the keywords “mastitis” and organism “Bubalus bubalis” ([83]36). The cross talk of candidate gene proteins with additional predicted putative candidate gene proteins were predicted computationally by PPI analysis using the STRING database (Version 11.0; [84]https://string-db.org/) ([85]37). This database scores and integrates the various sources of PPI after collecting these from the public domain and complements the information with computational predictions. STRING database aims at a comprehensive and objective global network, considering both, physical as well as functional interactions. Proteins are represented as nodes and the interactions among them are pictured as edges in the network. Furthermore, Cytoscape (Version 3.7.1) ([86]38) was used to visualize the PPI network. To get the most inter-connected nodes from the constructed PPI network complex, additional Cytoscape plugins (MCODE and CentiScape) were used. To predict the extended interaction among proteins, the retrieved number of proteins was increased up to 100. All these interactions were predicted specific to Bubalus bubalis species. Evaluation of Mastitis TGP Based Variant Mining (SNP and Indel) RNA-Seq data of mammary gland tissues (ERR315636) and Refseq of water buffalo (GCA_003121395.1) were downloaded from NCBI. The data was processed by FastQC ([87]39) to check the quality. After the quality was checked low-quality reads and adapters were removed using the Trimmomatic tool V. 0.36 ([88]40) by selecting the parameters of Phred quality score ≥20, Minimum reads length = 25 and Leading:4, Trailing:4 and Sliding-window:4:20. Filtered pair-end reads were aligned to the reference (genome assembly of water buffalo) using Burrows-Wheeler Aligner (bwa-0.7.17) tool ([89]41, [90]42). The Sequence Alignment Map (SAM) file was converted to Binary Alignment Map (BAM) using SAMtools-1.9 ([91]43) followed by sorting of BAM file. Variant calling analysis for analyzing genotype likelihoods was done with mpileup function with BCFtools 1.9 ([92]44). SNPs were filtered at p < 0.05, read depth (d) ≥10, quality depth (Q) ≥30, MQ (minimum root mean square mapping quality) ≥40, and flanking sequence length 50. Genomic coordinates of SNPs and indels from the VCF file were mapped with coordinates of 101 genes using in-house generated shell scripting. Development of Web Resource: Water Buffalo-MSTdb (WBMSTDb) Water Buffalo Mastitis Database (WBMSTDb) is an online relational database that catalogs information of bovine mastitis associated genes, their information along with the functions and annotations, predicted genes and their functions by close and other interaction PPI networks and in-silico SNPs and indels. The web resource also houses various pathways where these genes are known to be involved. It has been developed using “three tier architecture,” namely, a client tier, middle tier, and database tier ([93]Figure 1). Client-end assists to display the information related to services available and build the front-end layer of the application for end-users of their customized search related to mastitis in buffalo. The database layer has been developed using PHP (Hypertext Preprocessor) which is an open-source server-side scripting language. The application layer, also known as the middle tier or logic tier takes the information from the presentation tier and controls the application's core functionality. Database tier has the information related to mastitis associated genes, gene pathways, proteins, and PPI, functions, and variants, etc. cataloged in MySQL database using SQL query language. WBMSTDb has been developed using LAMP (Linux-Apache-MySQL-PHP) technology and launched at Apache server. Figure 1. [94]Figure 1 [95]Open in a new tab Three tier architecture representing client, middle and database tier of WBMSTDb. Results Mining and Mapping of Bovine Mastitis Associated Genes A total of 159 mastitis associated genes were sourced from literature, taking water buffalo (Bubalus bubalis) and cattle (Bos taurus) as reference. After removing the duplicates and alternative gene alias, 129 genes associated with mastitis were used for further study. Out of 129 unique genes associated with mastitis, 120 genes were retrieved from highly diverse cattle breeds like Canadian Holsteins, Holstein, Holstein Friesian, and Brown Swiss, Chinese Holstein, Sanhe cattle, Chinese Simmental, Luxi Yellow, and Bohai Black. The remaining 9 genes were reported both in cattle breeds mentioned and in buffalo breeds like Murrah and Egyptian buffalo. A total number of 34,139 genes were retrieved from Bubalus bubalis from NCBI ([96]30) for mapping. After manual curation, 34,067 genes (98.41%) were obtained, which were further used for mapping 129 mastitis associated genes. Of these, 101 genes were successfully mapped. [97]Supplementary Table 1 shows details including the mapping location (chromosome coordinates), chromosome number, strand orientation, the total number of exons, gene type, organism based on the literature survey of these genes, whether the information was retrieved from cattle or buffalo studies along with references. Chromosome-wise