Abstract One of the primary factors contributing to boar taint is the level of androstenone in porcine adipose tissues. A majority of the studies performed to identify candidate biomarkers for the synthesis of androstenone in testis tissues follow a reductionist approach, identifying and studying the effect of biomarkers individually. Although these studies provide detailed information on individual biomarkers, a global picture of changes in metabolic pathways that lead to the difference in androstenone synthesis is still missing. The aim of this work was to identify major pathways and interactions influencing steroid hormone synthesis and androstenone biosynthesis using an integrative approach to provide a bird’s eye view of the factors causing difference in steroidogenesis and androstenone biosynthesis. For this purpose, we followed an analysis procedure merging together gene expression data from boars with divergent levels of androstenone and pathway mapping and interaction network retrieved from KEGG database. The interaction networks were weighted with Pearson correlation coefficients calculated from gene expression data and significant interactions and enriched pathways were identified based on these networks. The results show that 1,023 interactions were significant for high and low androstenone animals and that a total of 92 pathways were enriched for significant interactions. Although published articles show that a number of these enriched pathways were activated as a result of downstream signaling of steroid hormones, we speculate that the significant interactions in pathways such as glutathione metabolism, sphingolipid metabolism, fatty acid metabolism and significant interactions in cAMP-PKA/PKC signaling might be the key factors determining the difference in steroidogenesis and androstenone biosynthesis between boars with divergent androstenone levels in our study. The results and assumptions presented in this study are from an in-silico analysis done at the gene expression level and further laboratory experiments at genomic, proteomic or metabolomic level are necessary to validate these findings. Introduction Androgens, the primary hormones secreted by testis control and regulate the development of male accessory reproductive organs and secondary sexual characteristics. In porcine genomics, the special importance given to studying the synthesis and degradation of androgens is mainly due to an androgenic pheromone called androstenone. The accumulation of androstenone in boar adipose tissues is one of the major factors contributing to boar taint [39][1]. Boar taint is described as an off odor or off taste of meat derived from non castrated male pigs. At present, in many countries, surgical castration of boars is the primary method to reduce boar taint in pork meat [40][2]. Since piglet castration without anesthesia is going to be banned in the European Community by 2018 due to animal welfare reasons [41][3], there is an immediate need to develop non surgical methods to reduce boar taint mainly by regulating the synthesis of androstenone. The two proposed non surgical methods to reduce boar taint are: (i) the use of chemicals or drugs to reduce boar taint [42][4] and (ii) breeding for favorable characteristics to reduce boar taint [43][5]. In this regard, it should be noted that the European Food Safety Authority (EFSA) has already expressed concerns over consumer perception of meats from animals treated with chemicals and drugs to reduce boar taint [44][6]. Consumer perception issues over the use of chemicals and drugs for boar taint reduction leaves breeding as a more sustainable and accepted method to adopt for reducing boar taint. In order to select favorable biomarkers for breeding pigs with low androstenone levels and hence reduce boar taint, it is crucial to understand the genetic machinery behind the synthesis of androstenone. Androstenone is synthesized in testes and metabolized in liver. Although studying genes, interactions and pathways in both testes and liver is essential to understand the entire androstenone metabolic processes, in this study the major focus points are the factors that could contribute to androstenone synthesis in testis tissues. The synthesis of androstenone from pregnenolone in testes is mainly catalyzed by the enzymes cytochrome P450C17 (CYP17A1) and cytochrome B5 (CYB5) along with other reductases. The enzyme 5α reductases (ST5AR) catalyze the final step in the synthesis of androstenone [45][7]. At this point, it should be taken into account that although a number of enzymes catalyzing various steps in androstenone synthesis have been identified, the entire metabolic processes involved in the synthesis of androstenone has not been understood completely. Nevertheless, several studies have been performed to identify candidate biomarkers related to the synthesis of androstenone in testes. A study focusing on genetic correlations of backfat with direct and associative effects for androstenone has found that direct effects had a genetic correlation of 0.14±0.08 and associative effects had a genetic correlation of −0.25±0.18 [46][8]. High throughput microarray studies have been conducted to study the difference in gene expression patterns between testis samples from boars with divergent androstenone levels [47][9], [48][10]. Additionally, several QTL (Quantitative trait loci) studies and GWAS (genome wide association studies) have also been done to identify candidate QTL regions and polymorphisms responsible for varying levels of androstenone [49][11]–[50][16]. An in-house study using data from RNA-seq technology has also been performed recently to identify candidate biomarkers for varying levels of androstenone in porcine testes and liver samples [51][17]. All these studies have been successful in identifying several candidate QTL regions, genes and polymorphisms as potential candidate biomarkers to pursue further detailed investigations. A general trend among these aforementioned studies is that the candidate biomarkers identified in these studies are mainly analyzed and explained individually using a reductionist approach. Although individual analysis of candidate biomarkers using a reductionist approach helps to study their functions in detail, a phenotype or a disease is seldom the consequence of a change in a single effector gene or gene product, but rather the result of a multitude of changes in a complex interaction network [52][18]. From this point of view, integrative approaches merging different data sources with gene expression profiles would be more suited to gain a better understanding of a complex trait such as androstenone. In human development and medicine, integrative analysis approaches merging gene expression profiles with pathway data or interaction network has been shown to be a powerful approach to understand the disease. The usual end result of such methods are diagnostic pathways or disease subnetworks, which are demonstrated to enhance the prediction accuracy of disease states and to be more reproducible than single genes [53][19]. In this work, we have followed an integrative analysis procedure by merging together interaction network and pathway information from KEGG pathway database along with gene expression data. A current limitation of this approach in terms of studying androstenone metabolism is that none of the major pathway databases contain data on metabolic reaction steps or gene interactions involved in androstenone biosynthesis. As a work around to this limitation, we have treated androstenone biosynthesis as an offshoot of steroid hormone (testosterone) synthesis pathway in testis under the assumption that the pathways and interaction events that affect steroid hormone biosynthesis could also affect androstenone biosynthesis. The major aim of this work was to identify and study the major metabolic pathways and interactions involved in the maintenance and regulation of steroidogenesis and androstenone biosynthesis using gene expression data from porcine testis samples with divergent levels of androstenone measurement through an integrative analysis approach. Materials and Methods Materials Expression data The expression dataset used in this study is from a previous in-house RNA seq experiment conducted in order to understand the genetic mechanism behind androstenone metabolism [54][17]. For the purposes of this work, we used only the ten testis samples from the original study. In the original study, these ten testis samples were selected from a pool of 100 boars. In this pool of animals, boars with a fat androsteone level of 0.5 µg/g or less were defined as low androstenone (LA) animals and boars with a fat androstenone concentration of 1.00 µg/g or more were defined as high androstenone (HA) animals. From this population, 5 animals with extreme high and low levels of androstenone were selected as sample LA and HA population. The average androstenone measurement of LA sample animals was 0.24±0.06 µg/g and the average androstenone measurement of HA sample animals was 2.48±0.56 µg/g. Among these 10 animals, two sets of 3 animals each: 1 LA and 2 HA animals in the first set and 2 LA and 1 HA animals in the second set were half siblings. Additional details of sample collection, library preparation and sequencing are available in [55][17]. Pathway and network data We retrieved pathway and interaction network data from KEGG database (Release 60.0). This interaction network was comprised of enzyme - enzyme (reaction steps) and protein - protein interactions mapped to the corresponding porcine gene identifiers and annotated with KEGG pathway names and identifiers in which the interactions occur. The interaction network consisted of 23,198 edges (interactions) between 3,510 nodes (genes) mapped to 197 pathways. Methods Expression data quality control, mapping and normalization The first step in expression data analysis was a quality control and filtering step. In this step, PCR primers and bad quality sequences (Phred score <20) identified in the raw reads using FASTQC quality control application [56][20] were trimmed off. The selection of threshold cut-off (Phred score >20) was arbitrary and yet this cut-off threshold ensured that only the reads with a base quality score of 99% or more were retained for further analysis. The filtered raw reads were mapped to latest Sus scrofa genome build, Sscrofa10.2 from NCBI using a “splice aware” mapping algorithm TopHat [57][21] to generate individual genome mapping files for each sample. The expression set (expression matrix) was created by calculating read counts (expression values) for each gene from these genome mapping files using BEDTools [58][22]. It has been shown that the read count expression data set generated from an RNA-seq experiment follows a negative binomial distribution [59][23], but the classical linear modeling analysis procedures developed for microarray data sets assumes the data to be normally distributed. Although various non parametric procedures (distribution free methods) can be used in this context, we found that the results given by such analysis procedures were statistically non significant, owing to the small sample size of our data set and the limited power of non parametric methods to draw significant conclusions from data sets with small sample sizes. Recently, Law et al. [60][24] proposed applying normal distribution based microarray like statistical analysis methods to RNA-seq read count data. In order to overcome the limitations of small sample sizes and non parametric methods to an extend and also following the proposed idea in [61][24] of using normal distribution based microarray like statistical analysis methods to RNA-seq read count data, we normalized and log2 transformed our expression data set using “voom” function implemented in limma R package [62][25]. Comparison of various normalization and differential expression analysis methods for RNA-seq data have shown that voom normalization combined with limma package is relatively unaffected by outliers and performed well under many conditions [63][26]. An additional study [64][27] concluded that modeling RNA-seq gene count data as log normal distribution with appropriate pseudo counts (limma voom modeling) is a reasonable approximation of the data. Mean-variance modeling at the observational level (voom) estimates mean-variance relationship in the read count data and computes weights for each observation based on this relationship [65][24]. Our expression dataset was generated and normalized based on the above mentioned procedure. Identification of significant interactions Since we intended to identify significant interactions based on information from expression data and pathway interaction network, the very first step after quality control and normalization of expression data set was to trim the expression data set for genes in pathway interaction network. There were a total of 2,871 genes in common between both the transformed expression dataset and the interaction network from KEGG database. The trimmed interaction network had 23,198 edges (interactions) between 2,871 nodes (genes). In the next step, we calculated Pearson Correlation Co-efficient (PCC) of gene expression values for both LA and HA expression sets separately and by using these correlation coefficients as edge weight values, we generated two edge weighted interaction networks, (i) LA network where LA correlation coefficients were used as edge weight values and (ii) HA network where HA correlation coefficients were used as edge weight values. At this step, only those correlations with an edge in the interaction network were considered and the remaining correlation coefficients, interactions and genes were excluded from further analysis. Both LA and HA correlation coefficient weighted interaction networks contained 2,871 nodes (genes) and 15,960 edges (interactions) respectively. We termed the interaction network weighted with correlation coefficients from LA samples as “LA network” and the one weighted with correlation coefficients from HA samples as “HA network”. In order to identify the interactions that are significantly different between both LA and HA networks, the edge weights (correlation coefficients) of both the networks were transformed to z-score using Fisher-r-to-z transformation based on the equation: graphic file with name pone.0091077.e001.jpg (1) where r is the PCC. Following the calculation of z-scores for interactions in both networks, the differences between the z-scores were also calculated. For an edge z-score in LA interaction network, the corresponding edge z-score from HA interaction network was retrieved and the difference between the z-scores was calculated as: graphic file with name pone.0091077.e002.jpg (2) In order to identify significant z-score[DIFF] (and there by significant interactions),we followed a two step evaluation criteria based on random sampling and permutation approach [66][28]. Permutation and resampling based methods for estimating significance thresholds have already been used in high throughput studies [67][29], [68][30]. The evaluation criteria used in this study were: (i) zscore[DIFF] should be significant at a threshold of empirical p-value <0.05 against a set of z-scores randomly generated from the original expression data and (ii) at least one of the correlation coefficients used to calculate zscore[DIFF] (in [69]equation 2) should be significant at a threshold of empirical p-value <0.05 against a set of correlation coefficients randomly generated from the original expression data. For generating the set of z-scores in evaluation criteria (i) the first step was to generate a random expression matrix by randomly shuffling and reassigning the expression values into two sample groups. By this random shuffling and reassigning, we aimed to break up the original ordering and classification of the expression values and generate two complete random expression matrices and artificially replicate a set of z-scores calculated from a random population. Pearson correlation coefficients, random z-scores and z-score differences were calculated on these random expression matrices following the steps described previously. This entire process was repeated 10,000 times to generate a set of random z-score differences (zscore[RAND]) for each interaction. Significance threshold empirical p-value for each zscore[DIFF] was calculated as: graphic file with name pone.0091077.e003.jpg (3) where N = 10,000. A similar procedure was followed for calculating significance threshold empirical p-value for correlations in evaluation criteria (ii), where empirical p-values were calculated between correlation coefficients from randomly sampled expression data (random population correlation coefficients) and the original correlation coefficients from LA or HA datasets. Selecting only significant zscore[DIFF]s for further analysis would imply that we were selecting only those gene – gene interactions with a significant difference between LA and HA z-scores when compared to the set of random population z-scores. By adding the additional criterion that at least one of the correlation coefficients used to calculate the zscore[DIFF] should be significantly different from the set of random population correlation coefficients, we ensured that the selected gene – gene interactions had not only significant zscore[DIFF]s but also at least one significant correlation coefficient when compared to the random population data. We termed these selected interactions as “significant interactions”, since zscore[DIFF] defined for these interactions (edges) and at least one of the correlation coefficients used to calculate zscore[DIFF] were significant with respect to random population data. Once the identification of significant interaction was complete, we further classified these significant interactions into 8 interaction types such as: HA positive, HA positive significance, HA negative, HA negative significance, LA positive, LA positive significance, LA negative and LA negative significance. The rules used for classification of these interaction types and edge colors and line styles used in visualization of these interaction types are given in [70]Table 1. These classification rules were mainly used in the visualization step, and all the interaction networks in this work were visualized using Cytoscape [71][31]. [72]Figure S1 shows a schematic diagram of the entire workflow used in this analysis. Table 1. Interaction edge classification rules. Correlation type Correlation coefficient in HA samples Correlation coefficient in LA samples Edge color for visualization Edge line style for visualization HA positive positive and significant negative red solid line HA positive significance positive and significant positive red dashed line HA negative negative and significant positive light green solid line HA negative significance negative and significant positive or negative light green dashed line LA positive negative positive and significant green solid line LA positivesignificance positive positive and significant green dashed line LA negative positive negative and significant orange solid line LA negative significance positive or negative negative and significant orange dashed line [73]Open in a new tab Set of rules used for the classification of interactions (correlations) and assigning interaction types, edge color and line styles. Once the identification and classification of significant interactions were completed, we performed a hypergeometric test to identify the pathways over-represented for these significant interactions. The purpose of performing a hypergeometric test here was to test whether there the overlap between the gene interactions to pathway mappings from KEGG database and the interactions identified in the steps above was significant. The hypergeometric test we used is an in-built function (phyper) available in R statistical environment [74][32] and the probability values generated by the phyper function were converted into p-values (1-probability) and were corrected for multiple testing using Benjamini-Hochberg procedure (BH-correction). All the pathways with a p-adjusted value significance threshold of p-adj <0.05 from hypergeometric tests were considered as significantly enriched pathways. Results and Discussion Results from our analysis show that 1,023 interactions between 826 genes were significant in LA and HA data sets. Network analysis revealed that these 1,023 interactions formed into an interaction network and the largest connected component of this network contained 848 edges (interactions) and 563 nodes (genes) ([75]Figure 1). File S1 (Cytoscape .xgmml file) contains the significant interactions visualized as a network along with additional information such as LA and HA correlation coefficients, raw read counts for each gene, empirical p-value and correlation type for each interaction. Node degree (number of interactions of a gene) calculations done on this network revealed that genes such as LOC100623707 (POLR2G), ADCY9, PDE8B, NUDT2, PDE8B and LOC100620235 (PIK3R1) were some of the highly connected genes in this network. Among the significant interactions in this network, 209 interactions were LA positive, 201 interactions were LA negative, 257 interactions were HA positive and 220 interactions were HA negative ([76]Table 2). Among the genes involved in significant interactions, gene CYP17A1 is discussed as a candidate gene for androstenone biosynthesis in a number of publications [77][9], [78][10], [79][33], [80][34] due to its role in the conversion of 17 α-Hydroxy progesterone into androstenedione, a preliminary step in the synthesis of androstenone and testosterone [81][35]. Additionally, the gene LOC100620470 (HSD17B6) is previously reported to be in an androstenone related QTL region [82][14] and was also involved in significant interactions in this study. 17 β-hydroxysteroid dehydrogenase type 6 enzyme encoded by this gene catalyzes the conversion of testosterone back to androstenedione [83][36]. The gene SMPD1, involved in significant interactions in this study, was shown to be downregulated in high androstenone Duroc animals, however this result was not confirmed in rcPCR (real competitive PCR) validation [84][10]. It was shown that the enzyme encoded by SMPD1 cleaves sphingomyelin to ceramide, which in turn inhibits CYP19, a gene catalyzing a number of reactions in the synthesis of cholesterol, steroids and other lipids [85][37]. Figure 1. Network visualization of significant interactions identified. [86]Figure 1 [87]Open in a new tab Legend: nodes – genes, edges – interactions with significant z-scores. Edge legend: Red solid edges: interactions positive and significant in HA samples, negative in LA samples. Red dashed edges: interactions positive and significant in HA samples, positive in LA samples. Orange solid edges: interactions positive in HA samples, negative and significant in LA samples. Orange dashed edges: interactions negative in HA samples, negative and significant in LA samples. Dark green solid edges: interactions positive and significant in LA samples, negative in HA samples. Dark green dashed edges: interactions positive and significant in LA samples, positive in HA samples. Light green solid edges: interactions positive in LA samples, negative and significant in HA samples. Light green dashed edges: interactions negative in LA samples, negative and significant in HA samples. Table 2. Network statistics table. Significant interactions Significant interactionsin enriched pathways Edges (Interactions) 1,023 848 Nodes (genes) 826 563 LA positive interactions 209 173 LA positive significance interactions 35 31 LA negative interactions 201 166 LA negative significance interactions 30 24 HA positive interactions 257 217 HA positive significance interactions 42 39 HA negative interactions 220 189 LA negative significance interactions 29 26 [88]Open in a new tab This table contains basic information on networks generated from significant interactions and significant interactions in enriched pathways. The major aim behind pathway enrichment analysis was to relate significant interactions to metabolic pathways and to identify the key pathways and interactions that might be relevant for porcine testicular steroidogenesis and androstenone synthesis. Pathway enrichment analysis showed that out of 1,023 significant interactions, 865 interactions between 718 genes were enriched in 92 pathways ([89]Table 3). File S2 (Cytoscape .xgmml file) contains network visualization of significant interactions in enriched pathways and each edge in this network holds attributes containing KEGG pathway identifiers and names of enriched pathways. Among these enriched pathways, the top 5 enriched pathways in terms of the number of interactions were: purine, pyrimidine and glycerophospholipid metabolism pathways, phosphatidylinositol signaling system and Jak-STAT signaling pathway ([90]Table 3). The significant interactions in pathways such as synthesis and degradation of ketone bodies, steroid biosynthesis, oxidative phosphorylation, butanoate metabolism, drug metabolism – other enzymes and RNA transport were found only in HA samples and the interactions in antigen processing and presentation pathway, intestinal immune network for IgA production, autoimmune thyroid disease and allograft rejection pathways were found only in case of LA sample set ([91]Table 3). Although the pathways mentioned here were some of the top enriched pathways in our analysis, literature references [92][38]–[93][43] suggest that a number of these