Abstract

   Previous studies of horse RNA-seq were performed by mapping sequence
   reads to the reference genome during transcriptome analysis. However in
   this study, we focused on two main ideas. First, differentially
   expressed genes (DEGs) were identified by de novo–based analysis (DBA)
   in RNA-seq data from six Thoroughbreds before and after exercise,
   here-after referred to as “de novo unique differentially expressed
   genes” (DUDEG). Second, by integrating both conventional DEGs and genes
   identified as being selected for during domestication of Thoroughbred
   and Jeju pony from whole genome re-sequencing (WGS) data, we give a new
   concept to the definition of DEG. We identified 1,034 and 567 DUDEGs in
   skeletal muscle and blood, respectively. DUDEGs in skeletal muscle were
   significantly related to exercise-induced stress biological process
   gene ontology (BP-GO) terms: ‘immune system process’; ‘response to
   stimulus’; and, ‘death’ and a KEGG pathways: ‘JAK-STAT signaling
   pathway’; ‘MAPK signaling pathway’; ‘regulation of actin cytoskeleton’;
   and, ‘p53 signaling pathway’. In addition, we found TIMELESS, EIF4A3
   and ZNF592 in blood and CHMP4C and FOXO3 in skeletal muscle, to be in
   common between DUDEGs and selected genes identified by evolutionary
   statistics such as F[ST] and Cross Population Extended Haplotype
   Homozygosity (XP-EHH). Moreover, in Thoroughbreds, three out of five
   genes (CHMP4C, EIF4A3 and FOXO3) related to exercise response showed
   relatively low nucleotide diversity compared to the Jeju pony. DUDEGs
   are not only conceptually new DEGs that cannot be attained from
   reference-based analysis (RBA) but also supports previous RBA results
   related to exercise in Thoroughbred. In summary, three exercise related
   genes which were selected for during domestication in the evolutionary
   history of Thoroughbred were identified as conceptually new DEGs in
   this study.

Introduction

   Since domestication, at around 3500 B.C.E, horses have mainly been used
   for riding and racing [55][1]. One domesticated breed of horses, the
   Thoroughbred, has been specifically bred for speed, endurance, and
   strength since the 18^th century. The extreme selection for these
   traits has resulted in a highly adapted athlete [56][2] with very high
   aerobic capacity [57][3], and high skeletal muscle mass [58][4], which
   comprises over 55% of total body mass [59][5]. The Thoroughbred is an
   excellent breed for competitive horse racing and by extension a
   valuable model for studying exercise response. A previous study has
   shown that exercise training in Thoroughbreds resulted in coordinated
   changes in the expression of genes related to metabolism, oxidative
   phosphorylation and muscle structure [60][6].

   Domestication leads to gradual changes at the genetic level by a
   process of selection in a population of animals or plants. Most
   domestic animals were selectively bred for the goal of benefitting
   human beings. Due to the combined effect of natural selection and
   human-controlled selective breeding, phenotypic changes, which are
   related to genetic mutation, accompany the domestication process. Some
   genetic mutations with beneficial phenotypic effects have been either
   highly enriched or vanished by selective sweeps [61][7]. A selective
   sweep is the reduction of genetic diversity in the neighboring DNA of a
   fixed mutation. Selective sweep regions in the genome can potentially
   be identified by a genome scan, and the low variation interval
   surrounding the selected gene can be found by fine-scale mapping. Using
   such genome scans, selective sweeps have been identified in domestic
   and natural (wild progenitor) populations [62][8], [63][9].

   Previous horse transcriptome studies using RNA-seq were carried out by
   mapping sequence reads to a reference genome. However, reference genome
   assembly has been known to have flaws including missing expressed genes
   [64][10], hundreds to thousands of miss-assemblies and large genomic
   deletions [65][11], and problems in trans-spliced genes [66][12].
   Therefore, the results and success of reference transcriptome assembly
   depends on both the availability and quality of the reference genome.
   On the contrary, de novo transcript assembly has several advantages.
   First, it does not depend on a reference genome [67][13]. This is a key
   advantage as many organisms do not have a high-quality finished
   reference genome. For these organisms, de novo assembly becomes the
   first analysis step. Also, it does not depend on the correct alignment
   of reads to known splice sites [68][14] or the prediction of novel
   splicing sites, both of which are required by reference-based
   assemblers. Trans-spliced transcripts and similar transcripts
   originating from chromosomal rearrangements can be assembled using the
   de novo approach. In addition, de novo transcriptome assembly can help
   researchers investigate genes that are absent in the reference genome
   due to the incompleteness of reference sequences [69][10]. Lastly, it
   can identify new transcripts and new transcript structures [70][15],
   [71][16]. However, reconstruction of full-length transcripts from short
   reads with considerable sequencing error rates poses substantial
   computational challenges [72][17]. Still, de novo assembly in RNA-seq
   is an important approach for carrying out transcriptomic studies.

   Recently, many de novo assembly software tools have been developed,
   most of which take the de Bruijn graph approach. This approach usually
   has two important parameters: k-mer length and coverage cutoff value
   [73][16]. Tools such as Trans-ABySS [74][15], Trinity [75][17], ABySS
   [76][18], Oases [77][19], Rnnotator [78][20], Multiple-k [79][21],
   SOAPdenovo [80][22] and Velvet [81][23] follow this approach.
   Considering these de novo assembly software tools, Manfred G Grabherr,
   et.al concluded that Trinity de novo assembly software tool is superior
   to others for a number of reasons. Specifically, Trinity fully
   reconstructs a large fraction of transcripts, including alternatively
   spliced isoforms and transcripts from recently duplicated genes. In
   addition, Trinity resolves ∼99% of the initial sequencing errors,
   determines splice isoforms, and distinguishes transcripts from recently
   duplicated and identified allelic variants [82][17].

   In this study, we used de novo-based analysis (DBA) to identify
   differentially expressed genes (DEGs) that could not be identified with
   reference-based analysis (RBA). Furthermore, we integrate DEGs with
   genes identified as being selected for during domestication to reveal
   genes that are related to the evolution of exercise response during the
   domestication process of the Thoroughbred.

Materials and Methods

Ethics statement

   This study was carried out in strict accordance with the
   recommendations in the Guide for the Care and Use of Laboratory Animals
   of Pusan National University. All experimental procedures used in this
   study were approved by the Institutional Animal Care and Use Committee
   of the Pusan National University (PNU-2013-0417). The owners of the
   Thoroughbred horses gave permission for their animals to be used in
   this study.

Analysis of horses RNA-seq data

1. RNA-seq data between before and after exercise

   We generated RNA-seq data from six Thoroughbred horses before and after
   exercise as described in a previous study [83][24]. Samples of skeletal
   muscle and blood were taken from six Thoroughbreds before and after
   exercise. ‘Before exercise’ samples were collected from the triceps
   brachii of the right leg and from the jugular vein and carotid artery
   of each horse. After an adequate resting period of several hours, the
   horses were subjected to a 30-min trot. Then, immediately after this
   trot, the ‘after exercise’ samples were collected from the same tissues
   of each individual. Thoroughbreds usually canter for 17–18 min per day.
   For the purposes of this study, a 30-min trot was taken to be the
   equivalent to 17–18 min of cantering. Total RNA from the skeletal
   muscle and blood samples were isolated using TRIzol (Invitrogen) and
   the RNeasy RNA purification kit with DNase treatment (Qiagen). mRNA was
   isolated from the total RNA using oligo-dT beads, then reverse
   transcribed into double-stranded cDNA fragments. Construction and
   sequencing of an RNA sequence library for each sample was carried out
   based on Illumina HiSeq2000 protocols in order to generate 90 bp
   paired-end reads. Twenty-four sets of transcriptome data were generated
   from muscle and blood samples of six horses obtained before and after
   exercise.

   The RNA-sequencing data from this study have been submitted to the NCBI
   Gene Expression Omnibus (GEO) ([84]http://www.ncbi.nlm.nih.gov/geo/)
   under the accession number [85]GSE37870.

2. De novo-based analysis (DBA)

   We used the Trinity de novo assembly software tool [86][17], [87][25]
   following the default settings, except for the following options:
   number of CPU and alignment method: bowtie. First, Trinity tools
   generated a reference for each obtained sample (a total of 24 samples)
   to detect DEGs. Second, Trinity tools generated a reference for each
   individual (a total of 4 samples) to compare with the SNPs from the
   whole genome sequence, RNA-seq (using reference transcriptome assembly)
   and RNA-seq (using de novo assembly).

   The component ID was converted to known transcript ID
   ([88]ftp://ftp.ensembl.org/pub/release-73/fasta/equus_caballus/cdna/)
   using Blastall [89][26], a user-friendly, free open source tool, which
   is suited for short read alignment. After conversion, we filtered the
   transcript ID by alignment length of higher than 80%, more detail about
   the de novo assembly method is given in the Supplementary Methods
   ([90]File S1).

   Reference-based analysis (RBA): Most of the RBA used in this study is
   described in Kim et al [91][27]. TopHat [92][28] (ver.1.4.1) was used
   to map the sequences to a horse reference genome and annotated using
   the EquCab2 database
   ([93]http://hgdownload.cse.ucsc.edu/downloads.html#horse).

3. DEG selection (de novo vs reference)

   We examined the differential expression of replicated count data by
   applying a method based on negative binomial model as implemented in
   the R package EdgeR [94][29]. This package was used because RNA
   sequence data may exhibit more variability than expected in a Poisson
   distribution due to wide dispersal in the genome. The method
   implemented in the EdgeR package automatically takes all known sources
   of variation into account. Significant DEGs were detected with a
   cut-off value of FDR<0.01, based on a paired design between‘before
   exercise’ and ‘after exercise’.

4. Genotype calling and SNP calling (de novo vs reference)

   Three open source packages were used for downstream processing and
   variant calling: Picard Tools ([95]http://picard.sourceforge.net),
   SAMtools [96][30] and the genome Analysis Toolkit (GATK [97][31]).
   Substitution calls were made with the GATK Unified Genotyper [98][32].
   All calls with a Phred-scaled quality of less than 30 were filtered out
   and VCFtools [99][33] was used for handling the vcf file format.

5. DAVID analysis (de novo vs reference)

   One simple but extremely widely used systems biology technique for
   highlighting biological processes is gene category over-representation
   analysis. In order to perform this analysis, genes are grouped into
   categories by a common biological property and then tested to find
   categories that are over represented among the differentially expressed
   genes. Gene Ontology (GO) categories are commonly used in this
   technique and there are many tools available for performing GO and KEGG
   pathway analysis. We used DAVID [100][34] web tool to convert the
   equine Ensembl gene IDs to official gene symbols. This was carried out
   by cross-matching the equine Ensembl gene IDs to the human Ensembl gene
   IDs and the official gene symbols. The representation of functional
   groups in blood and skeletal muscle relative to the whole genome was
   investigated using the Expression Analysis Systematic Explorer (EASE)
   tool [101][35] within DAVID. The EASE tool is a modified Fisher's exact
   test used to measure enrichment of gene ontology (GO) terms [102][36].
   To identify enriched GO terms, functionally clustered genes were
   filtered by an EASE value of <0.01. In addition, A KEGG pathway
   enrichment test was performed using EASE, with a cut-off value <0.1

6. Quantitative real-time reverse transcript-PCR (qRT-PCR) validation

   A blood sample was obtained from a Thoroughbred horse maintained at
   Ham-an Racing Horse Resort & Training Center before and after exercise.
   Exercise was performed as a 30-min trotting on a treadmill. Trizol
   reagent (Invitrogen) was used to extract total RNA from leukocytes
   after exercise, according to the Invitrogen manual. In order to prevent
   contamination of genomic DNA, RNase-free DNase kit (Qiagen) was used
   according to the manufacturer's operating manual. Total RNA
   quantification was performed by using NanoDrop® ND-1000
   Spectrophotometer. cDNAs were synthesized in a reaction with oligo-dT
   primers, moloney-murine leukemia virus (MMLV) reverse transcriptase
   (Promega), RNase inhibitor (Promega) and RNase-free ddH[2]O, which was
   incubated at 37°C for 4 h.

   To confirm the de novo unique differentially expressed genes revealed
   by RNA-Seq, seven DUDEGs were analyzed by RT-PCR amplification. The
   primers were designed using the PRIMER3 software
   ([103]http://frodo.wi.mit.edu/primer3/) ([104]Table S9 in File S1).

   The RT-PCR conditions were as follows: an initial step of 94°C for 10
   min, 35 cycles of 94°C for 30 sec, 60°C for 30 sec, 72°C for 30 sec,
   and final step of 72°C for 10 min. PCR bands were normalized with
   glyceraldehyde-3-phosphate dehydrogenase (GAPDH) band. RT-PCR products
   were visualized by gel electrophoresis on a 2.0% SeaKem LE agarose gel.

   cDNA was analyzed by BioRad CFX-96. All samples were measured in
   triplicate to ensure reproducibility, and C[t] value was calculated
   using 2^−ΔΔCt method [105][37].

Analysis of horse whole genome re-sequencing data

1. Whole genome re-sequencing data of Thoroughbred and Jeju domestic ponies

   Whole-blood samples were collected from 18 Thoroughbred racing
   stallions of the Korean Racing Authority, and from four male and two
   female Jeju domestic ponies (Equus caballus) of the Jeju Provincial
   Livestock Institute, Korea. A 10 mL sample of blood was drawn from the
   carotid artery of each horse and was treated with heparin to prevent
   clotting. Genomic DNA was extracted and a quality check was carried out
   using fluorescence-based quantification on an agarose gel, a standard
   electrophoresis on a 0.6% agarose gel and, via a pulsed-field gel,
   using 200 ng of DNA. Manufacturers' instructions were followed to
   create a paired library of 500-bp fragments. This consisted of the
   following: purified genomic DNA fragments of less than 800 bp,
   fragments with blunt ends, fragments with 5′ phosphorylated ends,
   fragments with a 3′- dA overhang, some with adaptor-modified ends,
   purified ligation product, and a genomic DNA library. Following this,
   we generated sequence data using HiSeq 2000 (Illumina, Inc).

2. Reference genome assembly

   Using the Burrows-Wheeler Aligner [106][38] with the default setting,
   pair-end sequence reads were mapped to the reference horse genome
   ([107]ftp://ftp.ensembl.org/pub/release-73/fasta/equus_caballus/dna/)
   ([108]Table S8 in File S1). The DNA re-sequencing data from this study
   have been submitted to the NCBI Sequence Read Archive (SRA) database
   under the accession numbers SRA053569, SRA054885 and SRP017702.

3. Genotype calling and SNP calling

   We used the following open-source software packages; Picard Tools,
   SAMtools, and the Genome analysis toolkit, for downstream processing
   and variant calling. Substitution calls were made with GATK
   UnifiedGenotyper20 and all calls with a Phred-scaled quality of less
   than 30 were filtered out. For each chromosome, we simultaneously
   inferred the phased haplotype and inputed the missing alleles for the
   entire set of Thoroughbred populations using BEAGLE [109][39].

4. Estimation of Nucleotide diversity, F[ST] and Cross Population Extended
Haplotype Homozygosity (XP-EHH) value

   Nucleotide diversity and long run of homozygosity (LROH) of
   Thoroughbred and Jeju domestic ponies for each chromosome were
   calculated by VCFtools. Conventional F[ST] [110][40] and Reynolds F[ST]
   [111][41] values were calculated for genes using Arlequin 3.5 [112][42]
   based on pairwise differences between the haplotypes of Thoroughbred
   and Jeju domestic ponies. In order to calculate F[ST], we used the
   horse genome to phase the haplotypes of the two populations. Also, to
   calculate F[ST] by each gene region, we used the genomic information
   (Ensembl Genes71, EquCab2), namely the Ensembl reference annotated gene
   information. We selected the genes of the top 1% of the empirical
   distribution (empirical p-value<0.01) [113][43]. The method Cross
   Population Extended Haplotype Homozygosity (XP-EHH) was used to detect
   selective sweeps using the software xpehh [114][44]
   ([115]http://hgdp.uchicago.edu/Software/). For XP-EHH analysis, we used
   haplotype information for all SNPs of the entire autosome, and we
   calculated Extended Haplotype Homozygosity (EHH) and the log-ratio
   integrated EHH (iHH) for the pairwise test of the Thoroughbred and Jeju
   domestic pony populations. The log ratios were standardized to have a
   mean of 0 and a variance of 1, and p-values were assigned assuming a
   normal distribution. We selected SNPs with p-values<0.01, which are
   considered to have strong selection signals. Then we apply a cutoff
   value of XP-EHH values<0 for finding adaptation in the Thoroughbred. We
   chose genes related with these SNPs by identifying genes located within
   a 10 kb [116][45] boundary of these SNPs. Since XP-EHH is not sensitive
   to allele frequencies, there is no need to stratify the data into
   frequency bins before determining significance. The p-values are
   empirical p-values; that is, a low p-value indicates that a locus is an
   outlier with respect to the rest of the genome. However, we note that
   loci detected as being under selection using this approach may be an
   under-representative sample of all truly selected loci; in particular,
   selection on standing variation and recessive loci are likely to be
   underrepresented [117][43].

Results

Differences in the results of reference-based and de novo-based assembly and
analysis

   Transcriptome analysis results of reference-based analysis (RBA) and de
   novo-based analysis (DBA) showed a substantial difference in the number
   of transcript and differentially expressed genes (DEGs) identified. In
   RBA, for blood and skeletal muscle, 15,900 and 17,927 transcripts were
   found, respectively, among which 2,244 and 1,405 were unique
   transcripts. In DBA, the numbers of transcripts in skeletal muscle and
   blood were 18,057 and 19,413, respectively with 4,401 and 2,892 unique
   transcripts. The numbers transcripts in common between RBA and DBA were
   13,656 for skeletal muscle tissue and 16,521 for blood ([118]Figure 1a
   and [119]Figure S1a in File S1). When the sample variance of RBA and
   DBA in skeletal muscle and blood were compared using multidimensional
   scaling (MDS) plot, the results for the two analyses were almost
   identical. The skeletal muscle samples were clustered into two
   subgroups: before and after exercise, but the blood samples did not
   show any clustering ([120]Figure 1b, [121]Figure S1b and Figure S2 in
   File S1). In RBA, the number of DEG in skeletal muscle and blood were
   2,818 and 455, respectively with 2,200 and 427 DEGs being unique to
   RBA. In DBA, the number of DEG in skeletal muscle and blood were 1,652
   and 595, respectively with 1,034 and 567 unique DEGs. The number of
   DEGs identified by both RBA and DBA were 618 and 28 in skeletal muscle
   and blood, respectively ([122]Figure 1c and [123]Figure S1c in File
   S1). These DEGs were compared using Heatmap visualization to examine
   their expression pattern in each analysis. The expression pattern was
   similar, however, the intensity of the expression was higher with DBA
   ([124]Figure 1d and [125]Figure S1d in File S1). Overall, in comparison
   to RBA, DBA identified a higher number of transcripts but a lower
   number of DEGs.

Figure 1. Summary of comparative analysis between de novo assembly and
reference genome assembly from RNA-seq data of skeletal muscle from six
Thoroughbreds before and after exercise (Total 12 samples).

   [126]Figure 1
   [127]Open in a new tab

   a) The number of transcripts in common between de novo assembly and
   reference genome assembly b) MDS plot of six Thoroughbreds before and
   after exercise using de novo assembly. c) The number of DEGs between de
   novo assembly and reference genome assembly. d) Heat-map visualization
   of common DEGs between de novo assembly and reference genome assembly:
   rows represent DEGs from skeletal muscle and columns represent assemble
   method from 6 horse samples (*First ‘B’ is for blood and ‘M’ is for
   muscle. ‘F1’, ‘F2’, ‘F3’ and ‘S3’ are horse samples. Last ‘B’ is for
   ‘before exercise’ and ‘P’ is for ‘after exercise’).

   We detected SNPs from two different next-generation sequencing methods
   (WGS and RNA-seq) and two different assembly methods (RBA and DBA) for
   each Thoroughbred sample (F1, F2 and F3 = male, S3 = female). The
   number and rate of SNPs in DBA of RNA-seq was 108,158 (0.031%), 110,502
   (0.031%), 105,920 (0.03%) and 101,887 (0.029%) in F1, F2, F3 and S3
   respectively, and the number and rate of SNPs in from RBA of RNA-seq
   were 284,859 (0.012%), 287,286 (0.012%), 276,241 (0.011%) and 265,729
   (0.011%) in F1, F2, F3 and S3, respectively ([128]Table S2 in File S1).

Identification of de novo unique differentially expressed genes (DUDEGs)
before and after exercise

   We identified DUDEGs from RNA-seq data using the expression profiles of
   genes in skeletal muscle and blood samples taken from six Thoroughbreds
   before and after exercise. There were a total of 1,034 significant
   DUDEGs (519 up-regulated, 515 down-regulated) in skeletal muscle and
   567 (314 up-regulated, 253 down-regulated) in blood (FDR<0.01). Among
   them, 456 (61 up-regulated, 395 down-regulated) in skeletal muscle and
   205 (93 up-regulated, 112 down-regulated) in blood were annotated
   ([129]Table S1 in File S1).

Validation of DUDEGs in horse RNA-seq data using RT-PCR

   We performed RT-PCR to validate the DUDEGs detected in horse blood. The
   seven genes (TIMELESS, EIF4A3, PGIW, ANK3, MSH3, SYNRG, ASGR2: 2
   up-regulated and 5 down-regulated) were randomly selected with
   conceptually new DEGs and logFC>2 in blood ([130]Table S1, S4 and S5 in
   File S1). The expression levels of DUDEGs between RNAseq and RT-PCR
   were highly similar ([131]Figure S8 in File S1). The results confirmed
   that DUDEGs identified in this study were reliable.

Functional annotation of DUDEGs

   We summarized the highest biological process gene ontology (BP-GO) of
   DUDEGs in skeletal muscle sample taken before and after exercise from
   six Thoroughbred RNA-seq data ([132]Figure 2). The other BP-GO of
   DUDEGs were summarized separately ([133]Figure S3, Figure S4, Figure S5
   and Figure S6 in File S1). The most significantly enriched terms among
   up-regulated genes in skeletal muscle were ‘biological adhesion’,
   ‘biological regulation’, ‘death’, ‘growth’, ‘immune system process’,
   ‘locomotion’, ‘multi-organism process’, and ‘response to stimulus’
   (P-value = 6.29E-02, P-value = 9.57E-04, P-value = 1.75E-06,
   P-value = 7.56E-03, P-value = 2.23E-13, P-value = 3.48E-03,
   P-value = 5.71E-04, and P-value = 5.16E-08, respectively). While, the
   most significantly enriched terms among down-regulated genes in
   skeletal muscle were ‘cellular component organization’, ‘cellular
   process’, ‘establishment of localization’, ‘localization’, and
   ‘metabolic process’ (P-value = 4.94E-04, P-value = 9.12E-04,
   P-value = 7.09E-02, P-value = 6.87E-02, P-value = 5.28E-02, and
   P-value = 1.89E-2, respectively). ‘Developmental process’ was the most
   significantly enriched term in skeletal muscle among both up and
   down-regulated genes. However, no terms were highlighted as being
   significantly enriched in blood. We summarized the cellular components
   and molecular function gene ontology of DUDEGs in RNA-seq data from
   skeletal muscle and blood of six Thoroughbreds before and after
   exercise ([134]Table S3 in File S1).

Figure 2. Biological process GO terms of tissues specific DEGs between before
and after exercise in Thoroughbred.

   [135]Figure 2
   [136]Open in a new tab

   Up–regulated genes indicate higher activation after exercise than
   before and down-regulation genes indicate lower activation after
   exercise than before exercise.

   Enriched KEGG pathways analysis using DUDEGs revealed that up-regulated
   genes in skeletal muscle and blood are associated with exercise-induced
   stress. The most significantly enriched terms in skeletal muscle were
   ‘p53 signaling pathway’, ‘regulation of actin cytoskeleton’, ‘JAK-STAT
   signaling pathway’, ‘MAPK signaling pathway’, ‘cell adhesion
   molecules’, ‘cytokine-cytokine receptor interaction’, ‘bladder cancer’,
   and ‘pathways in cancer’. In addition, the two terms ‘colorectal
   cancer’ and ‘biosynthesis of unsaturated fatty acids’ were
   significantly enriched in blood ([137]Table 1).

Table 1. Enriched KEGG pathways associated with DEGs in two tissue such as
skeletal muscle and blood.

   Highest KEGG Higher KEGG KEGG Blood Muscle
   UP DOWN UP DOWN
   Cellular Processes Cell growth and death Cell cycle V
   p53 signaling pathway V
   Cell motility Regulation of actin cytoskeleton V
   Environmental Information Processing Membrane transport ABC
   transporters V
   Signal transduction Jak-STAT signaling pathway V
   MAPK signaling pathway V
   Notch signaling pathway V
   Phosphatidylinositol signaling system V
   Signaling molecules and interaction Cell adhesion molecules (CAMs) V
   Cytokine-cytokine receptor interaction V
   Genetic Information Processing Folding, sorting and degradation
   Ubiquitin mediated proteolysis V
   Replication and repair Non-homologous end-joining V
   Human Diseases Cancers Colorectal cancer V
   Bladder cancer V V
   Pathways in cancer V
   Small cell lung cancer V
   Metabolism Carbohydrate metabolism Butanoate metabolism V
   Inositol phosphate metabolism V
   Glycan biosynthesis and metabolism
   Glycosylphosphatidylinositol(GPI)-anchor biosynthesis V V
   Lipid metabolism Biosynthesis of unsaturated fatty acids V
   Organismal Systems Immune system B cell receptor signaling pathway V
   Hematopoietic cell lineage V
   Natural killer cell mediated cytotoxicity V
   Toll-like receptor signaling pathway V
   T cell receptor signaling pathway V
   [138]Open in a new tab

   For each set of up-regulated and down-regulated. DEG in skeletal muscle
   and blood, a KEGG pathway enrichment analysis was performed. Starting
   from the right, the table shows: tissue type, status of regulation,
   KEGG pathway terms, higher-level KEGG pathway terms, and the highest
   level of KEGG pathway terms.

Integration of DUDEG and selected gene associated with nucleotide diversity,
F[ST] and Cross Population Extended Haplotype Homozygosity (XP-EHH)

   The genes under selection were further investigated using nucleotide
   diversity, F[ST] and XP-EHH from WGS of Thoroughbred and Jeju pony
   (skeletal muscle = 1,033 and blood = 567). We found 12 genes (ZNF592,
   CD58, C1orf162, USP37, FOXM1, TIMELESS, TRMT1, CALR, ASNA1, EIF4A3,
   SYNRG and FADS1) in blood and 14 genes (HERC2, CHD9, DDX28, CAPZA1,
   TSEN15, CHMP4C, FOXO3, PLD2, ANKRD13D, UNKL, CBFA2T2, NECAB3, SLC25A29
   and FBLN1) in skeletal muscle that were both identified as DUDEGs and
   implicated in F[ST] analysis as a selected gene ([139]Table S4 in File
   S1). The F[ST] distribution histogram for the pair of horse breeds is
   shown in [140]Figure S7 in File S1.

   Between DUDEGs and XP-EHH (selected genes in the Thoroughbred), we
   found 11 (ANK3, ZNF592, TOR1AIP1, TIMELESS, INSR, MED15, ZNF567,
   EIF4A3, EXOC6B, PPP4R2 and DYRK1A) and 48 common genes (URB2, NEO1,
   CORO2B, CPE, FUK, SYCE1L, TEC, RELL1, ZNF775, RGS4, FAM46C, PTGFRN,
   DENND2C, CTTNBP2NL, VAV1, BARX2, RAB31, VPS4B, CHMP4C, DEPTOR, ATG5,
   FOXO3, OSBPL7, PEX12, PIK3R5, PITX1, MARCH3, LMNB1, ST8SIA4, RASGRF2,
   ARSB, INPP4A, RNF144A, PPP4R2, FRMD4B, SPATA13, SLC7A1, CAB39L,
   B3GALT1, MECOM, PARP14, NPR3, TGM3, DHX35, AUH, C14orf102, COL27A1 and
   HLCS) in blood and skeletal muscle, respectively ([141]Table S5 in File
   S1). Among them, three genes, TIMELESS, EIF4A3 and ZNF592, in blood,
   and two genes, CHMP4C and FOXO3, in skeletal muscle were shown to be
   significant in all three analyses (DUDEG, F[ST], and XP-EHH). In
   comparison to Jeju pony, the Thoroughbred showed relatively low levels
   of nucleotide diversity at three out of the five identified genes
   ([142]Table 2 and [143]Figure 3).

Table 2. Co-matching genes between the DEGs, selected genes associated with
F[st] (F[st] cut-off value top 5% with empirical p-value<0.05) and
Thoroughbred selected genes associated with XP-EHH (XP-EHH cut-off value
empirical p-value<0.01 and XP-EHH value <−3.51551 significant SNPs).

   Sample Tissue Ens ID CHR Start End DEG logFC DEG P-value DEG FDR Gene
   symbol Reynolds Fst SNP region XP-EHH XP-EHH P-value
   Blood ENSECAG00000015925 11 2814986 2828951 −17.23 1.81E-34 2.94E-31
   EIF4A3 (eukaryotic translation initiation factor 4A3) 0.83481 0.56604
   2821795 −3.52526 9.73E-03
   ENSECAG00000016283 1 92397477 92416004 −14.5988 2.23E-26 5.06E-24
   ZNF592 (zinc finger protein 592) 0.67059 0.48859 92387687 −3.55925
   8.83E-03
   92397597 −3.78677 4.48E-03
   92398170 −3.74169 5.15E-03
   92398472 −3.78551 4.50E-03
   92398776 −3.83657 3.84E-03
   92398791 −4.18477 1.23E-03
   92398828 −3.56243 8.74E-03
   92398986 −3.5556 8.92E-03
   92399035 −3.76415 4.81E-03
   92402102 −3.60557 7.72E-03
   92402157 −3.58028 8.31E-03
   92402965 −3.58305 8.24E-03
   92403116 −3.61357 7.54E-03
   92403392 −3.55258 9.00E-03
   ENSECAG00000002892 6 74016170 74027809 9.80616 1.34E-12 6.75E-11
   TIMELESS (timeless circadian clock) 0.67349 0.49007 74019099 −4.02647
   2.09E-03
   74019114 −4.05026 1.93E-03
   74019135 −4.28442 8.68E-04
   74019141 −4.56193 3.15E-04
   74019149 −4.42328 5.27E-04
   74019152 −4.43649 5.02E-04
   74019153 −4.81799 1.16E-04
   74019241 −4.96438 6.40E-05
   Muscle ENSECAG00000022647 9 6229159 6251198 12.97629 5.96E-22 7.12E-20
   CHMP4C (charged multivesicular body protein 4C) 1.29183 0.72523 6257522
   −3.56699 8.63E-03
   ENSECAG00000024499 10 58506453 58523225 1.941151 0.001093 0.008744
   FOXO3 (forkhead box O3) 0.91477 0.59939 58521033 −3.5866 8.15E-03
   58521300 −3.67329 6.32E-03
   [144]Open in a new tab

Figure 3. Signatures of correlation between DEGs from Thoroughbred RNA-seq
and selected genes associated with nucleotide diversity, F[ST] and XP-EHH
from Thoroughbred and Jeju pony DNA sequence.

   [145]Figure 3
   [146]Open in a new tab

   a) Manhattan plot of F[ST] (Dotted line = cut-off value of the top 5%
   with empirical p-values of <0.05. Red point = common genes between the
   DEGs, Thoroughbred selected genes associated with F[ST] and XP-EHH). b)
   Manhattan plot of XP-EHH value (Red point = Common genes between the
   DEGs and Thoroughbred selected genes associated with F[ST] and XP-EHH).
   c) Nucleotide diversity line plot of five common genes (sky blue color
   line = Jeju pony, orange color line = Thoroughbred).

Discussion

Differences in result between reference-based and de novo-based assembly and
analysis

   Up to now, many studies of RNA-seq analyses have used reference-based
   analysis (RBA) when a reference genome for the species is available
   [147][27], [148][46], [149][47]. When a species does not have a
   reference genome, RBA using the reference genome of a closely related
   species or de novo-based analysis (DBA) is used. Several studies of
   RNA-seq analyses have used the align-then-assemble combined method
   (align-then-assemble and assemble-then-align) [150][24]. DBA used in
   align-then-assemble method assembles the unmapped sequence reads after
   RBA, which supplements the inherent weakness of RBA. However, we
   suggest that even if a species has a reference genome, DBA used in
   assemble-then-align is necessary for assembly of the total sequence
   reads including the unmapped sequence reads ([151]Table S6 and S7 in
   File S1). In this study, we found a significant difference in the
   number of transcript and differentially expressed genes (DEGs)
   identified by RBA and DBA. Also, a greater number of unique transcripts
   were identified by DBA than in RBA ([152]Figure 1a and [153]Figure S1a
   in File S1). This implies that the horse RNA-seq data in this study
   includes new transcripts and new transcript structures that are not
   included in the horse reference genome. Additionally, we found de novo
   unique differentially expressed genes (DUDEGs) which cannot be attained
   from RBA ([154]Figure 1c and [155]Figure S1c in File S1). Credibility
   of DBA in RNA-seq has been proven in numerous methods and protocol
   papers [156][17], [157][25], [158][48] and in this study. Both
   multidimensional scaling (MDS) plot ([159]Figure 1b, [160]Figure S1b
   and Figure S2 in File S1) and expression patterns of common DEGs did
   not show differences between the results of RBA and DBA. However, the
   intensity of the expression was different because the newly assembled
   transcriptome reference represents the individual transcriptome made
   using de Bruijn graph assemblers during DBA [161][48] ([162]Figure 1d
   and [163]Figure S1d in File S1).

   RNA-Seq can reveal sequence variations such as SNP in genes [164][49]
   as is possible with whole genome sequencing (WGS). Transcribed SNPs in
   RNA-seq are needed for accurate measurement of allele-specific
   expression [165][50], [166][51] and detection of novel SNPs. [167][52].
   Hence, we compared the number and rate of SNPs identified from the two
   NGS methods and found differences between the type of references and