Abstract

   One of the primary factors contributing to boar taint is the level of
   androstenone in porcine adipose tissues. A majority of the studies
   performed to identify candidate biomarkers for the synthesis of
   androstenone in testis tissues follow a reductionist approach,
   identifying and studying the effect of biomarkers individually.
   Although these studies provide detailed information on individual
   biomarkers, a global picture of changes in metabolic pathways that lead
   to the difference in androstenone synthesis is still missing. The aim
   of this work was to identify major pathways and interactions
   influencing steroid hormone synthesis and androstenone biosynthesis
   using an integrative approach to provide a bird’s eye view of the
   factors causing difference in steroidogenesis and androstenone
   biosynthesis. For this purpose, we followed an analysis procedure
   merging together gene expression data from boars with divergent levels
   of androstenone and pathway mapping and interaction network retrieved
   from KEGG database. The interaction networks were weighted with Pearson
   correlation coefficients calculated from gene expression data and
   significant interactions and enriched pathways were identified based on
   these networks. The results show that 1,023 interactions were
   significant for high and low androstenone animals and that a total of
   92 pathways were enriched for significant interactions. Although
   published articles show that a number of these enriched pathways were
   activated as a result of downstream signaling of steroid hormones, we
   speculate that the significant interactions in pathways such as
   glutathione metabolism, sphingolipid metabolism, fatty acid metabolism
   and significant interactions in cAMP-PKA/PKC signaling might be the key
   factors determining the difference in steroidogenesis and androstenone
   biosynthesis between boars with divergent androstenone levels in our
   study. The results and assumptions presented in this study are from an
   in-silico analysis done at the gene expression level and further
   laboratory experiments at genomic, proteomic or metabolomic level are
   necessary to validate these findings.

Introduction

   Androgens, the primary hormones secreted by testis control and regulate
   the development of male accessory reproductive organs and secondary
   sexual characteristics. In porcine genomics, the special importance
   given to studying the synthesis and degradation of androgens is mainly
   due to an androgenic pheromone called androstenone. The accumulation of
   androstenone in boar adipose tissues is one of the major factors
   contributing to boar taint [39][1]. Boar taint is described as an off
   odor or off taste of meat derived from non castrated male pigs. At
   present, in many countries, surgical castration of boars is the primary
   method to reduce boar taint in pork meat [40][2]. Since piglet
   castration without anesthesia is going to be banned in the European
   Community by 2018 due to animal welfare reasons [41][3], there is an
   immediate need to develop non surgical methods to reduce boar taint
   mainly by regulating the synthesis of androstenone. The two proposed
   non surgical methods to reduce boar taint are: (i) the use of chemicals
   or drugs to reduce boar taint [42][4] and (ii) breeding for favorable
   characteristics to reduce boar taint [43][5]. In this regard, it should
   be noted that the European Food Safety Authority (EFSA) has already
   expressed concerns over consumer perception of meats from animals
   treated with chemicals and drugs to reduce boar taint [44][6]. Consumer
   perception issues over the use of chemicals and drugs for boar taint
   reduction leaves breeding as a more sustainable and accepted method to
   adopt for reducing boar taint.

   In order to select favorable biomarkers for breeding pigs with low
   androstenone levels and hence reduce boar taint, it is crucial to
   understand the genetic machinery behind the synthesis of androstenone.
   Androstenone is synthesized in testes and metabolized in liver.
   Although studying genes, interactions and pathways in both testes and
   liver is essential to understand the entire androstenone metabolic
   processes, in this study the major focus points are the factors that
   could contribute to androstenone synthesis in testis tissues. The
   synthesis of androstenone from pregnenolone in testes is mainly
   catalyzed by the enzymes cytochrome P450C17 (CYP17A1) and cytochrome B5
   (CYB5) along with other reductases. The enzyme 5α reductases (ST5AR)
   catalyze the final step in the synthesis of androstenone [45][7]. At
   this point, it should be taken into account that although a number of
   enzymes catalyzing various steps in androstenone synthesis have been
   identified, the entire metabolic processes involved in the synthesis of
   androstenone has not been understood completely. Nevertheless, several
   studies have been performed to identify candidate biomarkers related to
   the synthesis of androstenone in testes. A study focusing on genetic
   correlations of backfat with direct and associative effects for
   androstenone has found that direct effects had a genetic correlation of
   0.14±0.08 and associative effects had a genetic correlation of
   −0.25±0.18 [46][8]. High throughput microarray studies have been
   conducted to study the difference in gene expression patterns between
   testis samples from boars with divergent androstenone levels [47][9],
   [48][10]. Additionally, several QTL (Quantitative trait loci) studies
   and GWAS (genome wide association studies) have also been done to
   identify candidate QTL regions and polymorphisms responsible for
   varying levels of androstenone [49][11]–[50][16]. An in-house study
   using data from RNA-seq technology has also been performed recently to
   identify candidate biomarkers for varying levels of androstenone in
   porcine testes and liver samples [51][17]. All these studies have been
   successful in identifying several candidate QTL regions, genes and
   polymorphisms as potential candidate biomarkers to pursue further
   detailed investigations.

   A general trend among these aforementioned studies is that the
   candidate biomarkers identified in these studies are mainly analyzed
   and explained individually using a reductionist approach. Although
   individual analysis of candidate biomarkers using a reductionist
   approach helps to study their functions in detail, a phenotype or a
   disease is seldom the consequence of a change in a single effector gene
   or gene product, but rather the result of a multitude of changes in a
   complex interaction network [52][18]. From this point of view,
   integrative approaches merging different data sources with gene
   expression profiles would be more suited to gain a better understanding
   of a complex trait such as androstenone. In human development and
   medicine, integrative analysis approaches merging gene expression
   profiles with pathway data or interaction network has been shown to be
   a powerful approach to understand the disease. The usual end result of
   such methods are diagnostic pathways or disease subnetworks, which are
   demonstrated to enhance the prediction accuracy of disease states and
   to be more reproducible than single genes [53][19]. In this work, we
   have followed an integrative analysis procedure by merging together
   interaction network and pathway information from KEGG pathway database
   along with gene expression data. A current limitation of this approach
   in terms of studying androstenone metabolism is that none of the major
   pathway databases contain data on metabolic reaction steps or gene
   interactions involved in androstenone biosynthesis. As a work around to
   this limitation, we have treated androstenone biosynthesis as an
   offshoot of steroid hormone (testosterone) synthesis pathway in testis
   under the assumption that the pathways and interaction events that
   affect steroid hormone biosynthesis could also affect androstenone
   biosynthesis. The major aim of this work was to identify and study the
   major metabolic pathways and interactions involved in the maintenance
   and regulation of steroidogenesis and androstenone biosynthesis using
   gene expression data from porcine testis samples with divergent levels
   of androstenone measurement through an integrative analysis approach.

Materials and Methods

Materials

Expression data

   The expression dataset used in this study is from a previous in-house
   RNA seq experiment conducted in order to understand the genetic
   mechanism behind androstenone metabolism [54][17]. For the purposes of
   this work, we used only the ten testis samples from the original study.
   In the original study, these ten testis samples were selected from a
   pool of 100 boars. In this pool of animals, boars with a fat
   androsteone level of 0.5 µg/g or less were defined as low androstenone
   (LA) animals and boars with a fat androstenone concentration of 1.00
   µg/g or more were defined as high androstenone (HA) animals. From this
   population, 5 animals with extreme high and low levels of androstenone
   were selected as sample LA and HA population. The average androstenone
   measurement of LA sample animals was 0.24±0.06 µg/g and the average
   androstenone measurement of HA sample animals was 2.48±0.56 µg/g. Among
   these 10 animals, two sets of 3 animals each: 1 LA and 2 HA animals in
   the first set and 2 LA and 1 HA animals in the second set were half
   siblings. Additional details of sample collection, library preparation
   and sequencing are available in [55][17].

Pathway and network data

   We retrieved pathway and interaction network data from KEGG database
   (Release 60.0). This interaction network was comprised of enzyme -
   enzyme (reaction steps) and protein - protein interactions mapped to
   the corresponding porcine gene identifiers and annotated with KEGG
   pathway names and identifiers in which the interactions occur. The
   interaction network consisted of 23,198 edges (interactions) between
   3,510 nodes (genes) mapped to 197 pathways.

Methods

Expression data quality control, mapping and normalization

   The first step in expression data analysis was a quality control and
   filtering step. In this step, PCR primers and bad quality sequences
   (Phred score <20) identified in the raw reads using FASTQC quality
   control application [56][20] were trimmed off. The selection of
   threshold cut-off (Phred score >20) was arbitrary and yet this cut-off
   threshold ensured that only the reads with a base quality score of 99%
   or more were retained for further analysis. The filtered raw reads were
   mapped to latest Sus scrofa genome build, Sscrofa10.2 from NCBI using a
   “splice aware” mapping algorithm TopHat [57][21] to generate individual
   genome mapping files for each sample. The expression set (expression
   matrix) was created by calculating read counts (expression values) for
   each gene from these genome mapping files using BEDTools [58][22]. It
   has been shown that the read count expression data set generated from
   an RNA-seq experiment follows a negative binomial distribution
   [59][23], but the classical linear modeling analysis procedures
   developed for microarray data sets assumes the data to be normally
   distributed. Although various non parametric procedures (distribution
   free methods) can be used in this context, we found that the results
   given by such analysis procedures were statistically non significant,
   owing to the small sample size of our data set and the limited power of
   non parametric methods to draw significant conclusions from data sets
   with small sample sizes. Recently, Law et al. [60][24] proposed
   applying normal distribution based microarray like statistical analysis
   methods to RNA-seq read count data. In order to overcome the
   limitations of small sample sizes and non parametric methods to an
   extend and also following the proposed idea in [61][24] of using normal
   distribution based microarray like statistical analysis methods to
   RNA-seq read count data, we normalized and log2 transformed our
   expression data set using “voom” function implemented in limma R
   package [62][25]. Comparison of various normalization and differential
   expression analysis methods for RNA-seq data have shown that voom
   normalization combined with limma package is relatively unaffected by
   outliers and performed well under many conditions [63][26]. An
   additional study [64][27] concluded that modeling RNA-seq gene count
   data as log normal distribution with appropriate pseudo counts (limma
   voom modeling) is a reasonable approximation of the data. Mean-variance
   modeling at the observational level (voom) estimates mean-variance
   relationship in the read count data and computes weights for each
   observation based on this relationship [65][24]. Our expression dataset
   was generated and normalized based on the above mentioned procedure.

Identification of significant interactions

   Since we intended to identify significant interactions based on
   information from expression data and pathway interaction network, the
   very first step after quality control and normalization of expression
   data set was to trim the expression data set for genes in pathway
   interaction network. There were a total of 2,871 genes in common
   between both the transformed expression dataset and the interaction
   network from KEGG database. The trimmed interaction network had 23,198
   edges (interactions) between 2,871 nodes (genes). In the next step, we
   calculated Pearson Correlation Co-efficient (PCC) of gene expression
   values for both LA and HA expression sets separately and by using these
   correlation coefficients as edge weight values, we generated two edge
   weighted interaction networks, (i) LA network where LA correlation
   coefficients were used as edge weight values and (ii) HA network where
   HA correlation coefficients were used as edge weight values. At this
   step, only those correlations with an edge in the interaction network
   were considered and the remaining correlation coefficients,
   interactions and genes were excluded from further analysis. Both LA and
   HA correlation coefficient weighted interaction networks contained
   2,871 nodes (genes) and 15,960 edges (interactions) respectively. We
   termed the interaction network weighted with correlation coefficients
   from LA samples as “LA network” and the one weighted with correlation
   coefficients from HA samples as “HA network”. In order to identify the
   interactions that are significantly different between both LA and HA
   networks, the edge weights (correlation coefficients) of both the
   networks were transformed to z-score using Fisher-r-to-z transformation
   based on the equation:
   graphic file with name pone.0091077.e001.jpg (1)

   where r is the PCC.

   Following the calculation of z-scores for interactions in both
   networks, the differences between the z-scores were also calculated.
   For an edge z-score in LA interaction network, the corresponding edge
   z-score from HA interaction network was retrieved and the difference
   between the z-scores was calculated as:
   graphic file with name pone.0091077.e002.jpg (2)

   In order to identify significant z-score[DIFF] (and there by
   significant interactions),we followed a two step evaluation criteria
   based on random sampling and permutation approach [66][28]. Permutation
   and resampling based methods for estimating significance thresholds
   have already been used in high throughput studies [67][29], [68][30].
   The evaluation criteria used in this study were: (i) zscore[DIFF]
   should be significant at a threshold of empirical p-value <0.05 against
   a set of z-scores randomly generated from the original expression data
   and (ii) at least one of the correlation coefficients used to calculate
   zscore[DIFF] (in [69]equation 2) should be significant at a threshold
   of empirical p-value <0.05 against a set of correlation coefficients
   randomly generated from the original expression data. For generating
   the set of z-scores in evaluation criteria (i) the first step was to
   generate a random expression matrix by randomly shuffling and
   reassigning the expression values into two sample groups. By this
   random shuffling and reassigning, we aimed to break up the original
   ordering and classification of the expression values and generate two
   complete random expression matrices and artificially replicate a set of
   z-scores calculated from a random population. Pearson correlation
   coefficients, random z-scores and z-score differences were calculated
   on these random expression matrices following the steps described
   previously. This entire process was repeated 10,000 times to generate a
   set of random z-score differences (zscore[RAND]) for each interaction.
   Significance threshold empirical p-value for each zscore[DIFF] was
   calculated as:
   graphic file with name pone.0091077.e003.jpg (3)

   where N = 10,000.

   A similar procedure was followed for calculating significance threshold
   empirical p-value for correlations in evaluation criteria (ii), where
   empirical p-values were calculated between correlation coefficients
   from randomly sampled expression data (random population correlation
   coefficients) and the original correlation coefficients from LA or HA
   datasets.

   Selecting only significant zscore[DIFF]s for further analysis would
   imply that we were selecting only those gene – gene interactions with a
   significant difference between LA and HA z-scores when compared to the
   set of random population z-scores. By adding the additional criterion
   that at least one of the correlation coefficients used to calculate the
   zscore[DIFF] should be significantly different from the set of random
   population correlation coefficients, we ensured that the selected gene
   – gene interactions had not only significant zscore[DIFF]s but also at
   least one significant correlation coefficient when compared to the
   random population data. We termed these selected interactions as
   “significant interactions”, since zscore[DIFF] defined for these
   interactions (edges) and at least one of the correlation coefficients
   used to calculate zscore[DIFF] were significant with respect to random
   population data. Once the identification of significant interaction was
   complete, we further classified these significant interactions into 8
   interaction types such as: HA positive, HA positive significance, HA
   negative, HA negative significance, LA positive, LA positive
   significance, LA negative and LA negative significance. The rules used
   for classification of these interaction types and edge colors and line
   styles used in visualization of these interaction types are given in
   [70]Table 1. These classification rules were mainly used in the
   visualization step, and all the interaction networks in this work were
   visualized using Cytoscape [71][31]. [72]Figure S1 shows a schematic
   diagram of the entire workflow used in this analysis.

Table 1. Interaction edge classification rules.

   Correlation type Correlation coefficient in HA samples Correlation
   coefficient in LA samples Edge color for visualization Edge line style
   for visualization
   HA positive positive and significant negative red solid line
   HA positive significance positive and significant positive red dashed
   line
   HA negative negative and significant positive light green solid line
   HA negative significance negative and significant positive or negative
   light green dashed line
   LA positive negative positive and significant green solid line
   LA positivesignificance positive positive and significant green dashed
   line
   LA negative positive negative and significant orange solid line
   LA negative significance positive or negative negative and significant
   orange dashed line
   [73]Open in a new tab

   Set of rules used for the classification of interactions (correlations)
   and assigning interaction types, edge color and line styles.

   Once the identification and classification of significant interactions
   were completed, we performed a hypergeometric test to identify the
   pathways over-represented for these significant interactions. The
   purpose of performing a hypergeometric test here was to test whether
   there the overlap between the gene interactions to pathway mappings
   from KEGG database and the interactions identified in the steps above
   was significant. The hypergeometric test we used is an in-built
   function (phyper) available in R statistical environment [74][32] and
   the probability values generated by the phyper function were converted
   into p-values (1-probability) and were corrected for multiple testing
   using Benjamini-Hochberg procedure (BH-correction). All the pathways
   with a p-adjusted value significance threshold of p-adj <0.05 from
   hypergeometric tests were considered as significantly enriched
   pathways.

Results and Discussion

   Results from our analysis show that 1,023 interactions between 826
   genes were significant in LA and HA data sets. Network analysis
   revealed that these 1,023 interactions formed into an interaction
   network and the largest connected component of this network contained
   848 edges (interactions) and 563 nodes (genes) ([75]Figure 1). File S1
   (Cytoscape .xgmml file) contains the significant interactions
   visualized as a network along with additional information such as LA
   and HA correlation coefficients, raw read counts for each gene,
   empirical p-value and correlation type for each interaction. Node
   degree (number of interactions of a gene) calculations done on this
   network revealed that genes such as LOC100623707 (POLR2G), ADCY9,
   PDE8B, NUDT2, PDE8B and LOC100620235 (PIK3R1) were some of the highly
   connected genes in this network. Among the significant interactions in
   this network, 209 interactions were LA positive, 201 interactions were
   LA negative, 257 interactions were HA positive and 220 interactions
   were HA negative ([76]Table 2). Among the genes involved in significant
   interactions, gene CYP17A1 is discussed as a candidate gene for
   androstenone biosynthesis in a number of publications [77][9],
   [78][10], [79][33], [80][34] due to its role in the conversion of 17
   α-Hydroxy progesterone into androstenedione, a preliminary step in the
   synthesis of androstenone and testosterone [81][35]. Additionally, the
   gene LOC100620470 (HSD17B6) is previously reported to be in an
   androstenone related QTL region [82][14] and was also involved in
   significant interactions in this study. 17 β-hydroxysteroid
   dehydrogenase type 6 enzyme encoded by this gene catalyzes the
   conversion of testosterone back to androstenedione [83][36]. The gene
   SMPD1, involved in significant interactions in this study, was shown to
   be downregulated in high androstenone Duroc animals, however this
   result was not confirmed in rcPCR (real competitive PCR) validation
   [84][10]. It was shown that the enzyme encoded by SMPD1 cleaves
   sphingomyelin to ceramide, which in turn inhibits CYP19, a gene
   catalyzing a number of reactions in the synthesis of cholesterol,
   steroids and other lipids [85][37].

Figure 1. Network visualization of significant interactions identified.

   [86]Figure 1
   [87]Open in a new tab

   Legend: nodes – genes, edges – interactions with significant z-scores.
   Edge legend: Red solid edges: interactions positive and significant in
   HA samples, negative in LA samples. Red dashed edges: interactions
   positive and significant in HA samples, positive in LA samples. Orange
   solid edges: interactions positive in HA samples, negative and
   significant in LA samples. Orange dashed edges: interactions negative
   in HA samples, negative and significant in LA samples. Dark green solid
   edges: interactions positive and significant in LA samples, negative in
   HA samples. Dark green dashed edges: interactions positive and
   significant in LA samples, positive in HA samples. Light green solid
   edges: interactions positive in LA samples, negative and significant in
   HA samples. Light green dashed edges: interactions negative in LA
   samples, negative and significant in HA samples.

Table 2. Network statistics table.

   Significant interactions Significant interactionsin enriched pathways
   Edges (Interactions) 1,023 848
   Nodes (genes) 826 563
   LA positive interactions 209 173
   LA positive significance interactions 35 31
   LA negative interactions 201 166
   LA negative significance interactions 30 24
   HA positive interactions 257 217
   HA positive significance interactions 42 39
   HA negative interactions 220 189
   LA negative significance interactions 29 26
   [88]Open in a new tab

   This table contains basic information on networks generated from
   significant interactions and significant interactions in enriched
   pathways.

   The major aim behind pathway enrichment analysis was to relate
   significant interactions to metabolic pathways and to identify the key
   pathways and interactions that might be relevant for porcine testicular
   steroidogenesis and androstenone synthesis. Pathway enrichment analysis
   showed that out of 1,023 significant interactions, 865 interactions
   between 718 genes were enriched in 92 pathways ([89]Table 3). File S2
   (Cytoscape .xgmml file) contains network visualization of significant
   interactions in enriched pathways and each edge in this network holds
   attributes containing KEGG pathway identifiers and names of enriched
   pathways. Among these enriched pathways, the top 5 enriched pathways in
   terms of the number of interactions were: purine, pyrimidine and
   glycerophospholipid metabolism pathways, phosphatidylinositol signaling
   system and Jak-STAT signaling pathway ([90]Table 3). The significant
   interactions in pathways such as synthesis and degradation of ketone
   bodies, steroid biosynthesis, oxidative phosphorylation, butanoate
   metabolism, drug metabolism – other enzymes and RNA transport were
   found only in HA samples and the interactions in antigen processing and
   presentation pathway, intestinal immune network for IgA production,
   autoimmune thyroid disease and allograft rejection pathways were found
   only in case of LA sample set ([91]Table 3). Although the pathways
   mentioned here were some of the top enriched pathways in our analysis,
   literature references [92][38]–[93][43] suggest that a number of these